VLMs时间感知能力研究

cs.AI updates on arXiv.org 10月23日 12:21

VLMs时间感知能力研究

本文研究大规模视觉语言模型（VLMs）的时间感知能力，提出TIME10k数据集，评估37个VLMs的时间感知能力，并提出基于时间感知的“时间线”表示方法，实现高效的时间推理。

arXiv:2510.19559v1 Announce Type: cross Abstract: Large-scale vision-language models (VLMs) such as CLIP have gained popularity for their generalizable and expressive multimodal representations. By leveraging large-scale training data with diverse textual metadata, VLMs acquire open-vocabulary capabilities, solving tasks beyond their training scope. This paper investigates the temporal awareness of VLMs, assessing their ability to position visual content in time. We introduce TIME10k, a benchmark dataset of over 10,000 images with temporal ground truth, and evaluate the time-awareness of 37 VLMs by a novel methodology. Our investigation reveals that temporal information is structured along a low-dimensional, non-linear manifold in the VLM embedding space. Based on this insight, we propose methods to derive an explicit ``timeline'' representation from the embedding space. These representations model time and its chronological progression and thereby facilitate temporal reasoning tasks. Our timeline approaches achieve competitive to superior accuracy compared to a prompt-based baseline while being computationally efficient. All code and data are available at https://tekayanidham.github.io/timeline-page/.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

视觉语言模型时间感知时间线表示 VLMs 时间推理

相关文章

Top Important Computer Vision Papers for the Week from 29/04 to 05/05

MS MARCO Web Search: A Large-Scale Information-Rich Web Dataset Featuring Millions of Real Clicked Query-Document Labels

This Week In Machine Learning & AI - 5/27/16: The White House on AI & Aggressive Self-Driving Cars

THRONE: Advancing the Evaluation of Hallucinations in Vision-Language Models

Google AI Introduces PaliGemma: A New Family of Vision Language Models

CinePile: A Novel Dataset and Benchmark Specifically Designed for Authentic Long-Form Video Understanding

Researchers from UC Berkeley, UIUC, and NYU Developed an Algorithmic Framework that Uses Reinforcement Learning (RL) to Optimize Vision-Language Models (VLMs)

Demystifying Vision-Language Models: An In-Depth Exploration

Unlocking the Potential of Multimodal Data: A Look at Vision-Language Models and their Applications

Llama3-V: A SOTA Open-Source VLM Model Comparable performance to GPT4-V, Gemini Ultra, Claude Opus with a 100x Smaller Model