4DLangVGGT: 4D Language Visual Geometry Grounded Transformer
Xianfeng Wu1, 3, 4#
·
Yajing Bai1, 3#
·
Minghan Li2
·
Xianzu Wu1, 5
·
Xueqi Zhao1, 6
·
Zhongyuan Lai1
·
Wenyu Liu3
·
Xinggang Wang3*
1 State Key Laboratory of Precision Blasting, Jianghan University, 2 Harvard AI and Robotics Lab, Harvard University, 3 School of EIC, Huazhong University of Science and Technology, 4 Department of Computing, The Hong Kong Polytechnic University, 5 Department of Computer Science, Hong Kong Baptist University, 6 School of Mathematics and Statistics, Hubei University of Education, #Equal contribution, * Corresponding Author
This is a PyTorch/GPU implementation of 4DLangVGGT
4DLangVGGT is a feed-forward framework for language-aware 4D scene understanding, combining StreamVGGT for dynamic geometry reconstruction with a Semantic Bridging Decoder (SBD) that aligns geometry tokens with language semantics. Unlike Gaussian Splatting methods that require per-scene optimization, our feed-forward design can be trained across multiple scenes and directly applied at inference, achieving scalable, efficient, and open-vocabulary 4D semantic fields with state-of-the-art performance on HyperNeRF and Neu3D benchmarks.
4D LangVGGT uses the following software versions:
- Python 3.10
- CUDA 12.4
First, please clone 4DLangVGGT according to the command below.
git clone https://github.com/hustvl/4DLangVGGT.git --single-branch
cd 4DLangVGGTThen create a conda environment using the following command:
# if you lose some pkgs
# apt-get update && apt-get install libgl1 ffmpeg libsm6 libxext6 -y
pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu124
pip install -r requirements.txt4DLangVGGT is trained and evaluated on the HyperNeRF and Neu3D datasets. Please download the datasets and put them in the folder ./data. For data processing, please refer to 4DLangSplat to generate segmentation map and extract CLIP and Video features.
Please download the checkpoint of StreamVGGT from here and put the checkpoint folder under ./ckpt/streamvggt
The checkpoint of 4DLangVGGT is availavle at Hugging Face and put the checkpoint folder under ./ckpt/4dlangvggt
Run the following command to generate the demo:
bash scripts/infer.shThe results will be saved under ./eval/eval_results.
The overall folder structure should be organized as follows:
4DLangVGGT
|-- ckpt
| |-- streamvggt
| | |-- checkpoints.pth
| | |-- model.safetensors
| |-- 4dlangvggt
| | |--
|-- data
| |-- hypernerf
| | |-- americano
| | | |-- annotations
| | | | |-- train
| | | | |-- README
| | | | |-- video_annotations.json
| | | |-- camera
| | | |-- rgb
| | | | |-- 1x
| | | | | |-- 000001.png
| | | | ...
| | | | |-- 2x
| | | | | |-- 000001.png
| | | |-- streamvggt_token
| | | | | |-- 000001.npy
| | | ...
| | | |-- dataset.json
| | | |-- metadata.json
| | | |-- points.npy
| | | |-- scene.json
| | | |-- points3D_downsample2.ply
| | |-- chickchicken
| | ...
| |-- neu3d
| | |-- coffee_martini
| | | |-- annotations
| | | | |-- train
| | ...
To reduce the amount of memory required during training, we first preprocess the video using StreamVGGT, extract the geometry tokens, and save them in the folder ./data/<dataset>/<class>/streamvggt_token. Take the americano class from the HyperNeRF dataset as an example, you need to ensure the extracted geometry tokens are in the folder ./data/hypernerf/americano/streamvggt_token.
python preprocess/generate_vggttoken.py \
--categories americano \
--img_root data/hypernerf \
--ckpt ckpt/streamvggt/checkpoints.pth \
--max_num 128 \
--device cudaWe provide the following commands for training.
torchrun --nproc_per_node=1 --nnodes=1 --node_rank=0 train.py --batch_size 8 \
--data_root YOUR_DATA_ROOT --streamvggt_ckpt_path YOUR_STREAMVGGT_CKPT \
--num_workers 0 --output_dir unify_hyper_clip --mode gt --cos --wandb --joint_train \
--feat_root clip_features-all_dim3 \@article{wu20254dlangvggt,
title={4DLangVGGT: 4D Language-Visual Geometry Grounded Transformer},
author={Wu, Xianfeng and Bai, Yajing and Li, Minghan and Wu, Xianzu and Zhao, Xueqi and Lai, Zhongyuan and Liu, Wenyu and Wang, Xinggang},
journal={arXiv preprint arXiv:2512.05060},
year={2025}
}
Our code is based on the following brilliant repositories:
Many thanks to these authors!
Released under the MIT License.
