V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding

Junqi Ge^1,4*, Ziyi Chen^1,4*, Jintao Lin^3,4*, Jinguo Zhu^4*, Xihui Liu³, Jifeng Dai^1,4, Xizhou Zhu^1,2,4†

¹Tsinghua University ²SenseTime Research

³University of Hong Kong ⁴Shanghai AI Laboratory

Arxiv

Dataset

Code

Model

Abstract

Vision-Language Models (VLMs) have shown promising capabilities in handling various multimodal tasks, yet they struggle in long-context scenarios. Empirical studies indicates VLM performance degrades sharply when the position encoding exceeds the model's context window.

To address this, we propose Variable Visual Position Encoding (V2PE), a novel positional encoding approach that employs variable and smaller increments for visual tokens. Our experiments demonstrate the effectiveness of V2PE to enhances VLMs' ability to effectively understand and reason over long multimodal contexts. We further integrate V2PE with our augmented long-context multimodal datasets to finetune the open-source VLM, InternVL2-2B. The finetuned model achieves strong performance on both standard and long-context multimodal tasks. Notably, when the sequence length of the training dataset is increased to 256K tokens, the model is capable of processing multimodal sequences up to 1M tokens.

Variable Visual Position Encoding (V2PE)

Figure 3. Illustration from our proposed Variable Visual Position Encoding (V2PE). Unlike the standard position encoding used in most VLMs, which shares the same stepwise positional increment for both visual and textual tokens, our proposed Variable Visual Positional Encoding (V2PE) uses smaller and variable positional increments specifically for visual tokens compared to textual tokens.

Performance v.s. Token Length

Figure 1. Performance on the image retrieval task using a token length of up to 1M on MM-NIAH across different VLMs.

Performance v.s. Positional Increments

Figure 5. Performance on image retrieval task in MM-NIAH (left) and QA task in Long-VQA (right) with different positional increments.

Table 4. Comparison with existing MLLMs on general MLLM benchmarks.

Table 5. Comparison with existing MLLMs on long context MLLM benchmarks.

Dataset Summary

Schematic of the detailed components of our custom-made training dataset.

Examples of DocVQA and ChartVQA subset in our proposed Long-VQA dataset & Image Needle In A Haystack in our proposed Long-MR dataset.

Bibtex

@misc{ge2024v2peimprovingmultimodallongcontext,
    title={V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding},
    author={Junqi Ge and Ziyi Chen and Jintao Lin and Jinguo Zhu and Xihui Liu and Jifeng Dai and Xizhou Zhu},
    year={2024},
    eprint={2412.09616},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2412.09616},
}