Skip to content

hustvl/4DLangVGGT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

4DLangVGGT: 4D Language Visual Geometry Grounded Transformer
Official PyTorch Implementation

4DLangVGGT: 4D Language Visual Geometry Grounded Transformer
Xianfeng Wu1, 3, 4# · Yajing Bai1, 3# · Minghan Li2 · Xianzu Wu1, 5 · Xueqi Zhao1, 6 · Zhongyuan Lai1 · Wenyu Liu3 · Xinggang Wang3*

1 State Key Laboratory of Precision Blasting, Jianghan University, 2 Harvard AI and Robotics Lab, Harvard University, 3 School of EIC, Huazhong University of Science and Technology, 4 Department of Computing, The Hong Kong Polytechnic University, 5 Department of Computer Science, Hong Kong Baptist University, 6 School of Mathematics and Statistics, Hubei University of Education, #Equal contribution, * Corresponding Author

This is a PyTorch/GPU implementation of 4DLangVGGT

Overview

4DLangVGGT is a feed-forward framework for language-aware 4D scene understanding, combining StreamVGGT for dynamic geometry reconstruction with a Semantic Bridging Decoder (SBD) that aligns geometry tokens with language semantics. Unlike Gaussian Splatting methods that require per-scene optimization, our feed-forward design can be trained across multiple scenes and directly applied at inference, achieving scalable, efficient, and open-vocabulary 4D semantic fields with state-of-the-art performance on HyperNeRF and Neu3D benchmarks.

Installation

4D LangVGGT uses the following software versions:

  • Python 3.10
  • CUDA 12.4

First, please clone 4DLangVGGT according to the command below.

git clone https://github.com/hustvl/4DLangVGGT.git --single-branch
cd 4DLangVGGT

Then create a conda environment using the following command:

# if you lose some pkgs
# apt-get update && apt-get install libgl1 ffmpeg libsm6 libxext6 -y 

pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu124

pip install -r requirements.txt

Dataset

4DLangVGGT is trained and evaluated on the HyperNeRF and Neu3D datasets. Please download the datasets and put them in the folder ./data. For data processing, please refer to 4DLangSplat to generate segmentation map and extract CLIP and Video features.

QuickStart

Download Checkpoints

Please download the checkpoint of StreamVGGT from here and put the checkpoint folder under ./ckpt/streamvggt

The checkpoint of 4DLangVGGT is availavle at Hugging Face and put the checkpoint folder under ./ckpt/4dlangvggt

Inference

Run the following command to generate the demo:

bash scripts/infer.sh

The results will be saved under ./eval/eval_results.

Folder Structure

The overall folder structure should be organized as follows:

4DLangVGGT
|-- ckpt
|   |-- streamvggt
|   |   |-- checkpoints.pth
|   |   |-- model.safetensors
|   |-- 4dlangvggt
|   |   |-- 
|-- data
|   |-- hypernerf
|   |   |-- americano
|   |   |   |-- annotations
|   |   |   |   |-- train
|   |   |   |   |-- README
|   |   |   |   |-- video_annotations.json
|   |   |   |-- camera
|   |   |   |-- rgb
|   |   |   |   |-- 1x
|   |   |   |   |   |-- 000001.png
|   |   |   |   ...
|   |   |   |   |-- 2x
|   |   |   |   |   |-- 000001.png
|   |   |   |-- streamvggt_token
|   |   |   |   |   |-- 000001.npy
|   |   |   ...
|   |   |   |-- dataset.json
|   |   |   |-- metadata.json
|   |   |   |-- points.npy
|   |   |   |-- scene.json
|   |   |   |-- points3D_downsample2.ply
|   |   |-- chickchicken
|   |   ...
|   |-- neu3d
|   |   |-- coffee_martini
|   |   |   |-- annotations
|   |   |   |   |-- train
|   |   ...

Training

Step1: Generate Geometry Tokens

To reduce the amount of memory required during training, we first preprocess the video using StreamVGGT, extract the geometry tokens, and save them in the folder ./data/<dataset>/<class>/streamvggt_token. Take the americano class from the HyperNeRF dataset as an example, you need to ensure the extracted geometry tokens are in the folder ./data/hypernerf/americano/streamvggt_token.

python preprocess/generate_vggttoken.py \
    --categories americano \
    --img_root data/hypernerf \
    --ckpt ckpt/streamvggt/checkpoints.pth \
    --max_num 128 \
    --device cuda

Step2: Train 4DLangVGGT

We provide the following commands for training.

torchrun --nproc_per_node=1 --nnodes=1 --node_rank=0 train.py --batch_size 8 \
                --data_root YOUR_DATA_ROOT --streamvggt_ckpt_path YOUR_STREAMVGGT_CKPT  \
                --num_workers 0 --output_dir unify_hyper_clip --mode gt --cos --wandb --joint_train \
                --feat_root clip_features-all_dim3 \

🏄 Top contributors:

Cite

@article{wu20254dlangvggt,
  title={4DLangVGGT: 4D Language-Visual Geometry Grounded Transformer},
  author={Wu, Xianfeng and Bai, Yajing and Li, Minghan and Wu, Xianzu and Zhao, Xueqi and Lai, Zhongyuan and Liu, Wenyu and Wang, Xinggang},
  journal={arXiv preprint arXiv:2512.05060},
  year={2025}
}

Acknowledgements

Our code is based on the following brilliant repositories:

Many thanks to these authors!

License

Released under the MIT License.

About

Official implementation of “4D LangVGGT: 4D Language-Visual Geometry Grounded Transformer”

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •