4DLangVGGT: 4D Language Visual Geometry Grounded Transformer
_{Official PyTorch Implementation}

`Project Page 🤩` | `HF Checkpoint 🚀` | `Paper 📝`

4DLangVGGT: 4D Language Visual Geometry Grounded Transformer
Xianfeng Wu^{1, 3, 4}^# · Yajing Bai^{1, 3}^# · Minghan Li² · Xianzu Wu^{1, 5} · Xueqi Zhao^{1, 6} · Zhongyuan Lai¹ · Wenyu Liu³ · Xinggang Wang³^*

_{¹ State Key Laboratory of Precision Blasting, Jianghan University, ² Harvard AI and Robotics Lab, Harvard University, ³ School of EIC, Huazhong University of Science and Technology, ⁴ Department of Computing, The Hong Kong Polytechnic University, ⁵ Department of Computer Science, Hong Kong Baptist University, ⁶ School of Mathematics and Statistics, Hubei University of Education, ^#Equal contribution, ^* Corresponding Author}

This is a PyTorch/GPU implementation of 4DLangVGGT

Overview

4DLangVGGT is a feed-forward framework for language-aware 4D scene understanding, combining StreamVGGT for dynamic geometry reconstruction with a Semantic Bridging Decoder (SBD) that aligns geometry tokens with language semantics. Unlike Gaussian Splatting methods that require per-scene optimization, our feed-forward design can be trained across multiple scenes and directly applied at inference, achieving scalable, efficient, and open-vocabulary 4D semantic fields with state-of-the-art performance on HyperNeRF and Neu3D benchmarks.

Installation

4D LangVGGT uses the following software versions:

Python 3.10
CUDA 12.4

First, please clone 4DLangVGGT according to the command below.

git clone https://github.com/hustvl/4DLangVGGT.git --single-branch
cd 4DLangVGGT

Then create a conda environment using the following command:

# if you lose some pkgs
# apt-get update && apt-get install libgl1 ffmpeg libsm6 libxext6 -y 

pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu124

pip install -r requirements.txt

Dataset

4DLangVGGT is trained and evaluated on the HyperNeRF and Neu3D datasets. Please download the datasets and put them in the folder ./data. For data processing, please refer to 4DLangSplat to generate segmentation map and extract CLIP and Video features.

QuickStart

Download Checkpoints

Please download the checkpoint of StreamVGGT from here and put the checkpoint folder under ./ckpt/streamvggt

The checkpoint of 4DLangVGGT is availavle at Hugging Face and put the checkpoint folder under ./ckpt/4dlangvggt

Inference

Run the following command to generate the demo:

bash scripts/infer.sh

The results will be saved under ./eval/eval_results.

Folder Structure

The overall folder structure should be organized as follows：

4DLangVGGT
|-- ckpt
|   |-- streamvggt
|   |   |-- checkpoints.pth
|   |   |-- model.safetensors
|   |-- 4dlangvggt
|   |   |-- 
|-- data
|   |-- hypernerf
|   |   |-- americano
|   |   |   |-- annotations
|   |   |   |   |-- train
|   |   |   |   |-- README
|   |   |   |   |-- video_annotations.json
|   |   |   |-- camera
|   |   |   |-- rgb
|   |   |   |   |-- 1x
|   |   |   |   |   |-- 000001.png
|   |   |   |   ...
|   |   |   |   |-- 2x
|   |   |   |   |   |-- 000001.png
|   |   |   |-- streamvggt_token
|   |   |   |   |   |-- 000001.npy
|   |   |   ...
|   |   |   |-- dataset.json
|   |   |   |-- metadata.json
|   |   |   |-- points.npy
|   |   |   |-- scene.json
|   |   |   |-- points3D_downsample2.ply
|   |   |-- chickchicken
|   |   ...
|   |-- neu3d
|   |   |-- coffee_martini
|   |   |   |-- annotations
|   |   |   |   |-- train
|   |   ...

Training

Step1: Generate Geometry Tokens

To reduce the amount of memory required during training, we first preprocess the video using StreamVGGT, extract the geometry tokens, and save them in the folder ./data/<dataset>/<class>/streamvggt_token. Take the americano class from the HyperNeRF dataset as an example, you need to ensure the extracted geometry tokens are in the folder ./data/hypernerf/americano/streamvggt_token.

python preprocess/generate_vggttoken.py \
    --categories americano \
    --img_root data/hypernerf \
    --ckpt ckpt/streamvggt/checkpoints.pth \
    --max_num 128 \
    --device cuda

Step2: Train 4DLangVGGT

We provide the following commands for training.

torchrun --nproc_per_node=1 --nnodes=1 --node_rank=0 train.py --batch_size 8 \
                --data_root YOUR_DATA_ROOT --streamvggt_ckpt_path YOUR_STREAMVGGT_CKPT  \
                --num_workers 0 --output_dir unify_hyper_clip --mode gt --cos --wandb --joint_train \
                --feat_root clip_features-all_dim3 \

🏄 Top contributors:

Cite

@article{wu20254dlangvggt,
  title={4DLangVGGT: 4D Language-Visual Geometry Grounded Transformer},
  author={Wu, Xianfeng and Bai, Yajing and Li, Minghan and Wu, Xianzu and Zhao, Xueqi and Lai, Zhongyuan and Liu, Wenyu and Wang, Xinggang},
  journal={arXiv preprint arXiv:2512.05060},
  year={2025}
}

Acknowledgements

Our code is based on the following brilliant repositories:

Many thanks to these authors!

License

Released under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
demo		demo
head		head
models		models
preprocess		preprocess
scripts		scripts
streamvggt		streamvggt
utils		utils
LICENSE		LICENSE
README.md		README.md
inference.py		inference.py
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

4DLangVGGT: 4D Language Visual Geometry Grounded Transformer
_{Official PyTorch Implementation}

`Project Page 🤩` | `HF Checkpoint 🚀` | `Paper 📝`

Overview

Installation

Dataset

QuickStart

Download Checkpoints

Inference

Folder Structure

Training

Step1: Generate Geometry Tokens

Step2: Train 4DLangVGGT

🏄 Top contributors:

Cite

Acknowledgements

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

hustvl/4DLangVGGT

Folders and files

Latest commit

History

Repository files navigation

4DLangVGGT: 4D Language Visual Geometry Grounded Transformer Official PyTorch Implementation

Project Page 🤩 | HF Checkpoint 🚀 | Paper 📝

Overview

Installation

Dataset

QuickStart

Download Checkpoints

Inference

Folder Structure

Training

Step1: Generate Geometry Tokens

Step2: Train 4DLangVGGT

🏄 Top contributors:

Cite

Acknowledgements

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

4DLangVGGT: 4D Language Visual Geometry Grounded Transformer
_{Official PyTorch Implementation}

`Project Page 🤩` | `HF Checkpoint 🚀` | `Paper 📝`

Packages