InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models
Hongyuan Tao1, Bencheng Liao1, Shaoyu Chen2, Haoran Yin2, Qian Zhang2, Wenyu Liu1, Xinggang Wang1,βοΈ
1Huazhong University of Science and Technology, 2Horizon Robotics
(βοΈ) corresponding author: [email protected]
InfiniteVL is a novel linear-complexity Vision-Language Model (VLM) architecture designed to overcome the computational bottlenecks of traditional Transformers in processing unlimited multimodal streams.
By synergizing Sliding Window Attention (SWA) for fine-grained local perception and Gated DeltaNet for efficient long-term memory, InfiniteVL achieves a "best of both worlds" balance. It delivers competitive performance on standard benchmarks (comparable to Qwen2.5-VL) while enabling constant-memory inference and high-throughput streaming.
- π High Efficiency: Achieves >3.6Γ inference speedup and constant memory footprint compared to FlashAttention-2 accelerated Transformers.
- β‘ Real-Time Streaming: Sustains a stable 24 FPS prefill speed on a single NVIDIA RTX 4090 for continuous video understanding.
- π§ Unlimited Context: Effectively retains context over extremely long sequences (tested >500K tokens) without OOM errors.
- π Strong Performance: Matches leading Transformer-based VLMs (e.g., Qwen2.5-VL-3B) and significantly outperforms previous linear VLMs (e.g., VL-Mamba, Cobra) on comprehensive aspects.
Dec. 10th, 2025: We release the InfiniteVL model weights and inference code! Please check Model Zoo.Dec. 10th, 2025: We release our paper on Arxiv.
We recommend using Anaconda or Miniconda to manage the environment. The code is tested on Python 3.11 + PyTorch 2.6.0 + CUDA 12.1.
1. Clone the repository:
git clone https://github.com/hustvl/InfiniteVL.git
cd InfiniteVL2. Create and activate a virtual environment:
conda create -n infinitevl python=3.11 -y
conda activate infinitevl3. Install Environment:
pip install -r requirements.txt- Introduction
- Getting started
- Architecture
- Training Strategy
- Performance & Main Results
- Model Zoo
- Advanced Usage (Streaming)
- Qualitative Analysis & Visualization
- Citation
- Acknowledgement
InfiniteVL adopts a hybrid architecture that synergizes the efficiency of linear attention with the precision of window-based attention. The model comprises a Vision Encoder (adapted from Qwen2.5-VL), a Projection MLP, and a Decoder-only LLM Backbone.
-
Hybrid Block Design: The LLM backbone consists of 9 Hybrid Blocks. Within each block, we strategically interleave:
- 1 Sliding Window Attention (SWA) Layer: Responsible for capturing high-resolution local context and fine-grained visual details.
- 3 Gated DeltaNet Layers: Responsible for modeling long-range global dependencies with linear complexity.
-
Constant Memory Footprint: Unlike traditional Transformers where the Key-Value (KV) cache grows linearly with sequence length ($O(N)$), the Gated DeltaNet layers compress history into a fixed-size memory state (e.g.,
$16 \times 128 \times 256$ ). This enables constant memory usage and constant inference latency, even when processing unlimited input streams. -
Seamless Integration: By combining SWA and Gated DeltaNet, InfiniteVL achieves the "best of both worlds":
- Local attention ensures high performance on information-intensive tasks (e.g., OCR, Document Understanding).
- Linear attention ensures efficiency and stability for long-context scenarios (e.g., Streaming Video Understanding).
To achieve strong multimodal performance with minimal training resources, InfiniteVL employs a three-stage progressive training strategy. This approach allows our linear-complexity model to inherit the vast knowledge of a Transformer teacher before adapting to long-context scenarios.
- Goal: Rapidly transfer knowledge from the Qwen2.5-VL teacher to the InfiniteVL student.
- Method: We replace the teacher's attention layers with Gated DeltaNet while keeping other parameters frozen. We use Layer-wise MSE Loss (to align internal states) and End-to-End KL Divergence (to align output logits).
- Significance: This bypasses the difficulty of training linear attention from scratch, ensuring a robust initialization.
- Goal: Unlock strong instruction-following and reasoning capabilities.
- Data: ~8M diverse multimodal instruction pairs, covering General VQA, OCR, Mathematics, and Code.
- Settings: Image resolution increased to 1344Γ1344; max context length set to 8,192.
- Outcome: Produces the Stage 2 Model, which offers the best performance on standard benchmarks.
- Goal: Activate the architecture's potential for unlimited-length processing and streaming.
- Data: A mixture of Stage 2 data (800K) and ~200K long-sequence samples (e.g., long videos, multi-page documents).
- Method: LoRA fine-tuning with context length extended to 32,768.
- Outcome: Produces the Stage 3 Model, enabling length extrapolation and stable streaming inference.
InfiniteVL is engineered for unlimited-input scenarios. Unlike Transformer-based models where cost grows linearly with history, InfiniteVL maintains constant computational cost and memory usage.
Hardware Setup: All efficiency results are measured on a single NVIDIA RTX 4090 GPU.
Figure 1: Comparison of streaming FPS and latency. InfiniteVL sustains real-time performance while Transformer baselines degrade rapidly.
InfiniteVL achieves state-of-the-art performance among linear-complexity VLMs. Crucially, thanks to our Hybrid Architecture and High-quality training strategies, it overcomes the traditional weakness of linear models in information-intensive tasks (e.g., OCR, Document Understanding), achieving results comparable to top-tier Transformer VLMs.
Figure 2: Comparison of InfiniteVL with existing VLMs on public multimodal understanding, real-world comprehension, text-rich, reasoning-centric multimodal benchmarks.
Key Takeaways:
- Best-in-Class Linear Model: Significantly outperforms previous linear VLMs (Cobra, MaTVLM) by large margins (+40-60 points on DocVQA/OCRBench).
- Transformer-Level Quality: Matches the performance of Qwen2.5-VL-3B on complex reasoning and text-rich tasks while being significantly faster in long contexts.
We release two versions of InfiniteVL-4B to cater to different application scenarios.
| Model | Stage | Description | Training context Length | Download |
|---|---|---|---|---|
| InfiniteVL-4B | Stage 2 | Best Generalist / Base. The checkpoint directly after Instruction SFT. It delivers the peak foundational performance on standard multimodal benchmarks (e.g., OCR, MMMU, MathVista) and preserves the most robust knowledge. | 8K | π€ Hugging Face |
| InfiniteVL-4B-LongSFT | Stage 3 | Long-Context Adapted. Fine-tuned using only a small amount of long-sequence multimodal data. It successfully activates length generalization for streaming scenarios, though its full potential on extreme contexts is not yet fully exploited. | 32K | π€ Hugging Face |
π‘ Recommendations:
- For Long-Context Inference: Please use the Stage 3 model. It enables stable streaming inference and avoids memory explosion.
- For Training / Fine-tuning: We strongly recommend using the Stage 2 model as your starting point. Since it maintains the strongest general capabilities and hasn't shifted towards the specific long-context distribution, it serves as the best foundation for adaptation to new tasks or domains.
Unlike Transformer-based VLMs where the KV cache grows dynamically, InfiniteVL maintains a constant-size memory state. This unique property allows us to use CUDA Graphs to capture the entire computation graph for both streaming prefill and decoding, eliminating kernel launch overheads and maximizing GPU utilization.
This is the key technology behind our 24 FPS real-time streaming performance.
Unlike Transformer-based VLMs where the KV cache grows dynamically, InfiniteVL maintains a constant-size memory state. This unique property allows us to use CUDA Graphs to capture the entire computation graph for streaming prefill, eliminating kernel launch overheads.
We provide a complete script in examples/demo_streaming_inference.py to demonstrate this capability.
π₯ Simulation Note: This script simulates a real-time streaming scenario by reading a local video file frame-by-frame. It treats the video as a continuous data stream, updating the global linear memory state on-the-fly without retraining.
β οΈ Requirement: This demo relies on the specialized model implementation (supportingStaticCachePreallocand CUDA Graphs) located in theinfinitevl/infinitevl_streamingdirectory. Please ensure your environment is set up correctly to import these modules.
# Make sure you are in the project root
python examples/demo_streaming_inference.py \
--model_path /path/to/InfiniteVL-4B \
--video_path assets/demo.mp4 \
--fps 30In addition to streaming prefill, InfiniteVL natively supports CUDA Graph-accelerated decoding. By capturing the decoding step into a static graph, we can achieve extremely low-latency token generation, further enhancing the responsiveness of real-time interactions.
π§ Coming Soon: The code for accelerated decoding is currently being refactored and cleaned up. We are working hard to release it as soon as possible. Please stay tuned!
We provide visualization cases to demonstrate InfiniteVL's robust performance across diverse scenarios, ranging from information-intensive static tasks to ultra-long streaming video understanding.
InfiniteVL effectively overcomes the traditional limitations of linear attention in detailed visual perception. By combining Sliding Window Attention with Gated DeltaNet, it excels at Dense Text Recognition (OCR), Chart Interpretation, and Complex Scene Description, delivering performance comparable to full-attention Transformers.
The core strength of InfiniteVL lies in its ability to maintain coherent memory over unlimited input streams.
The examples below demonstrate a continuous street-view video stream. InfiniteVL maintains a constant memory state and accurately answers questions at various timestamps (e.g., Frame 3100, ~1M tokens processed), recalling specific details like "NBC Studios" text or the color of a pedestrian's bag without forgetting.
If you have any questions, please contact Hongyuan Tao via email ([email protected]).
If you find InfiniteVL useful for your research or applications, please consider citing our paper:
@article{tao2025infinitevl,
title={InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models},
author={Tao, Hongyuan and Liao, Bencheng and Chen, Shaoyu and Yin, Haoran and Zhang, Qian and Liu, Wenyu and Wang, Xinggang},
journal={arXiv preprint},
year={2025}
}InfiniteVL is built upon the giants of the open-source community. We would like to express our gratitude to:
- Qwen2.5-VL: For providing a powerful vision-language codebase and vision encoder.
- Gated DeltaNet: For the efficient linear attention mechanism and CUDA kernel implementations (FLA).
- Open-Source Datasets: We sincerely thank the creators of the high-quality datasets used in our training, including FineVision, LLaVA-OneVision, PixMo, The Cauldron, Docmatix, LLaVA-Video, and others. Their contributions are essential to the development of efficient multimodal models.





