V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding

Junqi Ge1,4*, Ziyi Chen1,4*, Jintao Lin3,4*, Jinguo Zhu4*, Xihui Liu3, Jifeng Dai1,4, Xizhou Zhu1,2,4†

1 Tsinghua University 2 SenseTime Research

3 University of Hong Kong 4 Shanghai AI Laboratory

Abstract

Vision-Language Models (VLMs) have shown promising capabilities in handling various multimodal tasks, yet they struggle in long-context scenarios. Empirical studies indicates VLM performance degrades sharply when the position encoding exceeds the model's context window.

To address this, we propose Variable Visual Position Encoding (V2PE), a novel positional encoding approach that employs variable and smaller increments for visual tokens. Our experiments demonstrate the effectiveness of V2PE to enhances VLMs' ability to effectively understand and reason over long multimodal contexts. We further integrate V2PE with our augmented long-context multimodal datasets to finetune the open-source VLM, InternVL2-2B. The finetuned model achieves strong performance on both standard and long-context multimodal tasks. Notably, when the sequence length of the training dataset is increased to 256K tokens, the model is capable of processing multimodal sequences up to 1M tokens.


position_encoding
position_encoding

Variable Visual Position Encoding (V2PE)

Figure 3. Illustration from our proposed Variable Visual Position Encoding (V2PE). Unlike the standard position encoding used in most VLMs, which shares the same stepwise positional increment for both visual and textual tokens, our proposed Variable Visual Positional Encoding (V2PE) uses smaller and variable positional increments specifically for visual tokens compared to textual tokens.

Performance v.s. Token Length

Figure 1. Performance on the image retrieval task using a token length of up to 1M on MM-NIAH across different VLMs.

result_model
result_alpha result_alpha

Performance v.s. Positional Increments

Figure 5. Performance on image retrieval task in MM-NIAH (left) and QA task in Long-VQA (right) with different positional increments.

bench_general

Table 4. Comparison with existing MLLMs on general MLLM benchmarks.

bench_long

Table 5. Comparison with existing MLLMs on long context MLLM benchmarks.

Dataset Summary

Schematic of the detailed components of our custom-made training dataset.

dataset

Examples of DocVQA and ChartVQA subset in our proposed Long-VQA dataset & Image Needle In A Haystack in our proposed Long-MR dataset.

Bibtex

@misc{ge2024v2peimprovingmultimodallongcontext,
    title={V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding},
    author={Junqi Ge and Ziyi Chen and Jintao Lin and Jinguo Zhu and Xihui Liu and Jifeng Dai and Xizhou Zhu},
    year={2024},
    eprint={2412.09616},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2412.09616},
}

down arrow
up arrow