MarkTechPost@AI 11月08日 01:42
空间感知能力:长视频多模态AI的新前沿
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

研究人员提出了一种名为Cambrian-S的空间感知视频多模态大语言模型家族,旨在解决当前AI模型在处理长视频时遇到的挑战。与依赖大量计算和扩大上下文窗口的传统方法不同,Cambrian-S专注于预测和选择性记忆关键事件。研究团队还发布了VSI Super基准和VSI 590K数据集,以测试和训练模型在长视频中的空间感知能力。该模型通过预测下一帧的潜在表示并利用“惊喜度”来动态管理记忆,显著提升了在长视频中的观察、回忆和计数任务表现,标志着AI从被动理解向主动预测的转变。

🎯 **空间感知能力是长视频多模态AI的关键挑战**:现有AI模型在处理长、复杂的视频流时,难以有效追踪物体和计数,即使拥有强大的计算能力和更大的上下文窗口也表现不佳。未来的竞争优势将来自能够预测未来并选择性记忆意外、重要事件的模型,而非仅仅依靠计算规模。

🧠 **Cambrian-S模型与VSI Super基准的创新**:研究团队推出了Cambrian-S模型家族,专注于长视频中的空间感知能力,并发布了VSI Super基准(包括VSR和VSC测试)以及VSI 590K数据集。VSI Super通过长时间室内视频的观察、回忆和计数任务,暴露了当前模型在持续空间感知方面的局限性,证明了单纯扩大上下文窗口并非解决方案。

💡 **预测性感知与惊喜驱动记忆**:Cambrian-S引入了“预测性感知”机制,通过预测下一帧的潜在表示,并利用预测误差(即“惊喜度”)来动态压缩和存储视频帧。高惊喜度的帧被详细保留,低惊喜度的帧则被压缩,从而在保持GPU内存占用的同时,显著提升了模型在长视频中的表现,尤其是在VSR和VSC任务上优于现有模型。

📈 **空间感知能力的发展层级**:研究将空间感知能力定义为一系列超越纯粹语言推理的进展,包括语义感知、流式事件认知、隐式三维空间认知和预测性世界建模。Cambrian-S旨在实现这些更高层次的能力,要求模型跨越时间记住空间布局,推理物体位置和数量,并预测三维世界的变化。

Even strong ‘long-context’ AI models fail badly when they must track objects and counts over long, messy video streams, so the next competitive edge will come from models that predict what comes next and selectively remember only surprising, important events, not from just buying more compute and bigger context windows. A team of researchers from New York University and Stanford introduce Cambrian-S, a spatially grounded video multimodal large language model family, together with the VSI Super benchmark and the VSI 590K dataset to test and train spatial supersensing in long videos.

https://arxiv.org/pdf/2511.04670

From video question answering to spatial supersensing

The research team frames spatial supersensing as a progression of capabilities beyond linguistic only reasoning. The stages are semantic perception, streaming event cognition, implicit 3D spatial cognition and predictive world modeling.

Most current video MLLMs sample sparse frames and rely on language priors. They often answer benchmark questions using captions or single frames rather than continuous visual evidence. Diagnostic tests show that several popular video benchmarks are solvable with limited or text only input, so they do not strongly test spatial sensing.

Cambrian-S targets the higher stages of this hierarchy, where the model must remember spatial layouts across time, reason about object locations and counts and anticipate changes in a 3D world.

VSI Super, a stress test for continual spatial sensing

To expose the gap between current systems and spatial supersensing, the research team designed VSI Super, a two part benchmark that runs on arbitrarily long indoor videos.

https://arxiv.org/pdf/2511.04670

VSI Super Recall, or VSR, evaluates long horizon spatial observation and recall. Human annotators take indoor walkthrough videos from ScanNet, ScanNet++ and ARKitScenes and use Gemini to insert an unusual object, such as a Teddy Bear, into four frames at different spatial locations. These edited sequences are concatenated into streams up to 240 minutes. The model must report the order of locations where the object appears, which is a visual needle in a haystack task with sequential recall.

https://arxiv.org/pdf/2511.04670

VSI Super Count, or VSC, measures continual counting under changing viewpoints and rooms. The benchmark concatenates room tour clips from VSI Bench and asks for the total number of instances of a target object across all rooms. The model must handle viewpoint changes, revisits and scene transitions and maintain a cumulative count. Evaluation uses mean relative accuracy for durations from 10 to 120 minutes.

When Cambrian-S 7B is evaluated on VSI Super in a streaming setup at 1 frame per second, accuracy on VSR drops from 38.3 percent at 10 minutes to 6.0 percent at 60 minutes and becomes zero beyond 60 minutes. VSC accuracy is near zero across lengths. Gemini 2.5 Flash also degrades on VSI Super despite a long context window, which shows that brute force context scaling is not sufficient for continual spatial sensing.

VSI 590K, spatially focused instruction data

To test whether data scaling can help, the research team construct VSI 590K, a spatial instruction corpus with 5,963 videos, 44,858 images and 590,667 question answer pairs from 10 sources.

Sources include 3D annotated real indoor scans such as ScanNet, ScanNet++ V2, ARKitScenes, S3DIS and Aria Digital Twin, simulated scenes from ProcTHOR and Hypersim and pseudo annotated web data such as YouTube RoomTour and robot datasets Open X Embodiment and AgiBot World.

The dataset defines 12 spatial question types, such as object count, absolute and relative distance, object size, room size and appearance order. Questions are generated from 3D annotations or reconstructions so that spatial relationships are grounded in geometry rather than text heuristics. Ablations show that annotated real videos contribute the largest gains on VSI Bench, followed by simulated data and then pseudo annotated images and that training on the full mix gives the best spatial performance.

https://arxiv.org/pdf/2511.04670

Cambrian-S model family and spatial performance

Cambrian-S builds on Cambrian-1 and uses Qwen2.5 language backbones at 0.5B, 1.5B, 3B and 7B parameters with a SigLIP2 SO400M vision encoder and a two layer MLP connector.

Training follows a four stage pipeline. Stage 1 performs vision language alignment on image text pairs. Stage 2 applies image instruction tuning, equivalent to the improved Cambrian-1 setup. Stage 3 extends to video with general video instruction tuning on a 3 million sample mixture called Cambrian-S 3M. Stage 4 performs spatial video instruction tuning on a mixture of VSI 590K and a subset of the stage 3 data.

https://arxiv.org/pdf/2511.04670

On VSI Bench, Cambrian-S 7B reaches 67.5 percent accuracy and outperforms open source baselines like InternVL3.5 8B and Qwen VL 2.5 7B as well as proprietary Gemini 2.5 Pro by more than 16 absolute points. The model also maintains strong performance on Perception Test, EgoSchema and other general video benchmarks, so the focus on spatial sensing does not destroy general capabilities.

Predictive sensing with latent frame prediction and surprise

To go beyond static context expansion, the research team propose predictive sensing. They add a Latent Frame Prediction head, which is a two layer MLP that predicts the latent representation of the next video frame in parallel with next token prediction.

Training modifies stage 4. The model uses mean squared error and cosine distance losses between predicted and ground truth latent features, weighted against the language modeling loss. A subset of 290,000 videos from VSI 590K, sampled at 1 frame per second, is reserved for this objective. During this stage the connector, language model and both output heads are trained jointly, while the SigLIP vision encoder remains frozen.

https://arxiv.org/pdf/2511.04670

At inference time the cosine distance between predicted and actual features becomes a surprise score. Frames with low surprise are compressed before being stored in long term memory and high surprise frames are retained with more detail. A fixed size memory buffer uses surprise to decide which frames to consolidate or drop and queries retrieve frames that are most relevant to the question.

https://arxiv.org/pdf/2511.04670

For VSR, this surprise driven memory system lets Cambrian-S maintain accuracy as video length increases while keeping GPU memory usage stable. It outperforms Gemini 1.5 Flash and Gemini 2.5 Flash on VSR at all tested durations and avoids the sharp degradation seen in models that only extend context.

For VSC, the research team designed a surprise driven event segmentation scheme. The model accumulates features in an event buffer and when a high surprise frame signals a scene change, it summarizes that buffer into a segment level answer and resets the buffer. Aggregating segment answers gives the final count. In streaming evaluation, Gemini Live and GPT Realtime achieve less than 15 percent mean relative accuracy and drop near zero on 120 minute streams, while Cambrian-S with surprise segmentation reaches about 38 percent at 10 minutes and maintains around 28 percent at 120 minutes.

Key Takeaways

    Cambrian-S and VSI 590K show that careful spatial data design and strong video MLLMs can significantly improve spatial cognition on VSI Bench, but they still fail on VSI Super, so scale alone does not solve spatial supersensing. VSI Super, through VSR and VSC, is intentionally built from arbitrarily long indoor videos to stress continual spatial observation, recall and counting, which makes it resistant to brute force context window expansion and standard sparse frame sampling.Benchmarking shows that frontier models, including Gemini 2.5 Flash and Cambrian S, degrade sharply on VSI Super even when video lengths remain within their nominal context limits, revealing a structural weakness in current long context multimodal architectures. The Latent Frame Prediction based predictive sensing module uses next latent frame prediction error, or surprise, to drive memory compression and event segmentation, which yields substantial gains on VSI Super compared to long context baselines while keeping GPU memory usage stable. The research work positions spatial supersensing as a hierarchy from semantic perception to predictive world modeling and argues that future video MLLMs must incorporate explicit predictive objectives and surprise driven memory, not only larger models and datasets, to handle unbounded streaming video in real applications.

Editorial Comments

Cambrian-S is a useful stress test of current video MLLMs because it shows that VSI SUPER is not just a harder benchmark, it exposes a structural failure of long context architectures that still rely on reactive perception. The predictive sensing module, based on Latent Frame Prediction and surprise driven memory, is an important step because it couples spatial sensing with internal world modeling rather than only scaling data and parameters. This research signals a shift from passive video understanding to predictive spatial supersensing as the next design target for multimodal models.


Check out the Paper. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

The post Why Spatial Supersensing is Emerging as the Core Capability for Multimodal AI Systems? appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Cambrian-S 空间感知 多模态AI 长视频理解 预测性感知 惊喜度 VSI Super Spatial Supersensing Multimodal AI Long Video Understanding Predictive Sensing Surprise VSI Super
相关文章