1 Tsinghua University 2 SenseTime Research
3 University of Hong Kong 4 Shanghai AI Laboratory
Vision-Language Models (VLMs) have shown promising capabilities in handling various multimodal tasks, yet they struggle in long-context scenarios. Empirical studies indicates VLM performance degrades sharply when the position encoding exceeds the model's context window.
To address this, we propose Variable Visual Position Encoding (V2PE), a novel positional encoding approach that employs variable and smaller increments for visual tokens. Our experiments demonstrate the effectiveness of V2PE to enhances VLMs' ability to effectively understand and reason over long multimodal contexts. We further integrate V2PE with our augmented long-context multimodal datasets to finetune the open-source VLM, InternVL2-2B. The finetuned model achieves strong performance on both standard and long-context multimodal tasks. Notably, when the sequence length of the training dataset is increased to 256K tokens, the model is capable of processing multimodal sequences up to 1M tokens.
Figure 3. Illustration from our proposed Variable Visual Position Encoding (V2PE). Unlike the standard position encoding used in most VLMs, which shares the same stepwise positional increment for both visual and textual tokens, our proposed Variable Visual Positional Encoding (V2PE) uses smaller and variable positional increments specifically for visual tokens compared to textual tokens.
Figure 1. Performance on the image retrieval task using a token length of up to 1M on MM-NIAH across different VLMs.
Figure 5. Performance on image retrieval task in MM-NIAH (left) and QA task in Long-VQA (right) with different positional increments.
Table 4. Comparison with existing MLLMs on general MLLM benchmarks.
Table 5. Comparison with existing MLLMs on long context MLLM benchmarks.
Schematic of the detailed components of our custom-made training dataset.
Examples of DocVQA and ChartVQA subset in our proposed Long-VQA dataset & Image Needle In A Haystack in our proposed Long-MR dataset.
@misc{ge2024v2peimprovingmultimodallongcontext,
title={V2PE: Improving Multimodal Long-Context
Capability of Vision-Language Models with Variable Visual Position
Encoding},
author={Junqi Ge and Ziyi Chen and Jintao Lin
and Jinguo Zhu and Xihui Liu and Jifeng Dai and Xizhou Zhu},
year={2024},
eprint={2412.09616},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2412.09616},
}