The Devil is in Temporal Token: High Quality Video Reasoning Segmentation

Dalian University of Technology¹, Harvard University²
CVPR 2025
^*Corresponding Auther

In-the-wild Video Demo

An item used for timekeeping in daily life and offering aesthetic appeal, commonly worn on the wrist and often featuring a circular or rectangular design, occasionally appearing in the video.

A white object with a circular smooth surface, typically placed on a table and unmoved by the videographer, designed to hold and present various types of food, offering functionality in dining settings.

A prominent figure subtly highlighted within a commercial promotional video, whose presence and actions serve as the central point for engagement and communication.

Who displays the most dynamic and expressive range of movement during the dance, transitioning seamlessly between sharp, high-energy motions, captivating attention with their vibrant and energetic performance.

Abstract

Existing methods for Video Reasoning Segmentation rely heavily on a single special token to represent the object in the keyframe or the entire video, inadequately capturing spatial complexity and inter-frame motion. To overcome these challenges, we propose VRS-HQ, an end-to-end video reasoning segmentation approach that leverages Multimodal Large Language Models (MLLMs) to inject rich spatiotemporal features into hierarchical tokens. Our key innovations include a Temporal Dynamic Aggregation (TDA) and a Token-driven Keyframe Selection (TKS). Specifically, we design frame-level <SEG> and temporal-level <TAK> tokens that utilize MLLM's autoregressive learning to effectively capture both local and global information. Subsequently, we apply a similarity-based weighted fusion and frame selection strategy, then utilize SAM2 to perform keyframe segmentation and propagation. To enhance keyframe localization accuracy, the TKS filters keyframes based on SAM2's occlusion scores during inference. VRS-HQ achieves state-of-the-art performance on ReVOS, surpassing VISA by 5.9%/12.5%/9.1% in J&F scores across the three subsets. These results highlight the strong temporal reasoning and segmentation capabilities of our method.

Comparison with Previous Methods

Comparison with previous VRS approaches. (a) Previous methods utilize a single <SEG> token for keyframe-based segmentation, depending heavily on external models for keyframe detection and mask propagation. This reliance can hinder accurate keyframe localization and prevent end-to-end inference. (b) In contrast, VRS-HQ introduces frame-level <SEG> and a temporal <TAK> token for dynamic aggregation. The aggregated <TAK> token is then used for both keyframe selection and mask generation within SAM2. This enables single-stage inference with precise keyframe selection and high-quality segmentation. (c) VRS-HQ achieves state-of-the-art per- formance on various image and video datasets across reasoning and referring segmentation.

Overall Framework of VRS-HQ

(a) VRS-HQ architecture. VRS-HQ incorporates a Multimodal Large Language Model for Temporal Token Encoding (<SEG> and <TAK> tokens, §3.1), a Temporal Dynamic Aggregation, a Token-driven Keyframe Selection and Mask Decoding and Propogation. (b) Temporal Dynamic Aggregation (TDA) merges frame-level <SEG> tokens into a temporal <TAK> token using a weighted fusion based on cosine similarity. (§3.2). (c) Token-driven Keyframe Selection (TKS). During training, the frame with the <SEG> token closest to the <TAK> token is selected as the keyframe. During inference, keyframe selection is refined using SAM2's occlusion scores and token similarity scores (§3.3). (d) Mask Decoding and Propagation (MDP). The <TAK> token provides a sparse embedding for SAM2, generating a keyframe mask and propagating it to other frames via a memory mechanism (§3.4).

Qualitative Comparison of VRS-HQ and VISA in Various Scenarios on the ReVOS Benchmark

Qualitative comparison of VRS-HQ and VISA in explicit language-based referring scenarios on the ReVOS benchmark.

Qualitative comparison of VRS-HQ and VISA in scenarios incorporating complex temporal dynamics on the ReVOS benchmark.

Qualitative comparison of VRS-HQ and VISA in reasoning scenarios that require world knowledge on the ReVOS benchmark.

BibTeX

@article{gong2025devil, title={The Devil is in Temporal Token: High Quality Video Reasoning Segmentation}, author={Gong, Sitong and Zhuge, Yunzhi and Zhang, Lu and Yang, Zongxin and Zhang, Pingping and Lu, Huchuan}, journal={arXiv preprint arXiv:2501.08549}, year={2025} }