MarkTechPost@AI 2024年12月17日
Meta AI Releases Apollo: A New Family of Video-LMMs Large Multimodal Models for Video Understanding
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Meta AI与斯坦福大学合作推出Apollo系列视频多模态模型,旨在提升视频理解能力。Apollo模型通过创新的帧率采样、双视觉编码器和分阶段训练等技术,有效处理长达一小时的视频,并在多项视频语言任务中表现出色。该模型家族包含1.5B、3B和7B参数版本,在多种计算条件下均能实现高性能,同时还推出了ApolloBench基准测试,用于更有效地评估视频模型的性能。Apollo的发布标志着视频多模态模型发展的重要一步,为视频分析、内容理解等应用提供了强大的工具。

⏱️ 帧率采样:Apollo采用帧率采样而非均匀采样,更好地捕捉视频中的运动和时间信息,保持时间流的连贯性。

📈 扩展一致性:在较小模型上验证的设计选择,能有效迁移到更大模型上,降低了大规模实验的成本,同时保持性能。

👁️‍🗨️ 双视觉编码器:结合SigLIP的空间理解能力和InternVideo2的时间推理能力,实现对视频数据更全面的表示。

🔀 Token重采样:通过感知器重采样器高效减少视频tokens,在不损失信息的情况下处理长视频,减少计算开销。

📊 ApolloBench:专门为视频多模态模型设计的基准测试套件,减少评估冗余,提供更详细的性能洞察。

While multimodal models (LMMs) have advanced significantly for text and image tasks, video-based models remain underdeveloped. Videos are inherently complex, combining spatial and temporal dimensions that demand more from computational resources. Existing methods often adapt image-based approaches directly or rely on uniform frame sampling, which poorly captures motion and temporal patterns. Moreover, training large-scale video models is computationally expensive, making it difficult to explore design choices efficiently.

To tackle these issues, researchers from Meta AI and Stanford developed Apollo, a family of video-focused LMMs designed to push the boundaries of video understanding. Apollo addresses these challenges through thoughtful design decisions, improving efficiency, and setting a new benchmark for tasks like temporal reasoning and video-based question answering.

Meta AI Introduces Apollo: A Family of Scalable Video-LMMs

Meta AI’s Apollo models are designed to process videos up to an hour long while achieving strong performance across key video-language tasks. Apollo comes in three sizes – 1.5B, 3B, and 7B parameters – offering flexibility to accommodate various computational constraints and real-world needs.

Key innovations include:

Technical Highlights and Advantages

The Apollo models are built on a series of well-researched design choices aimed at overcoming the challenges of video-based LMMs:

    Frame-Per-Second Sampling: Unlike uniform frame sampling, fps sampling maintains a consistent temporal flow, allowing Apollo to better understand motion, speed, and sequence of events in videos.Scaling Consistency: Experiments show that model design choices made on moderately sized models (2B-4B parameters) generalize well to larger models. This approach reduces computational costs while maintaining performance gains.Dual Vision Encoders: Apollo uses two complementary encoders: SigLIP, which excels at spatial understanding, and InternVideo2, which enhances temporal reasoning. Their combined strengths produce more accurate video representations.Token Resampling: By using a Perceiver Resampler, Apollo efficiently reduces video tokens without losing information. This allows the models to process long videos without excessive computational overhead.Optimized Training: Apollo employs a three-stage training process where video encoders are initially fine-tuned on video data before integrating with text and image datasets. This staged approach ensures stable and effective learning.Multi-Turn Conversations: Apollo models can support interactive, multi-turn conversations grounded in video content, making them ideal for applications like video-based chat systems or content analysis.

Performance Insights

Apollo’s capabilities are validated through strong results on multiple benchmarks, often outperforming larger models:

    Apollo-1.5B:
      Surpasses models like Phi-3.5-Vision (4.2B) and LongVA-7B.Scores: 60.8 on Video-MME, 63.3 on MLVU, 57.0 on ApolloBench.
    Apollo-3B:
      Competes with and outperforms many 7B models.Scores: 58.4 on Video-MME, 68.7 on MLVU, 62.7 on ApolloBench.Achieves 55.1 on LongVideoBench.
    Apollo-7B:
      Matches and even surpasses models with over 30B parameters, such as Oryx-34B and VILA1.5-40B.Scores: 61.2 on Video-MME, 70.9 on MLVU, 66.3 on ApolloBench.

Benchmark Summary:

Conclusion

Apollo marks a significant step forward in video-LMM development. By addressing key challenges such as efficient video sampling and model scalability, Apollo provides a practical and powerful solution for understanding video content. Its ability to outperform larger models highlights the importance of well-researched design and training strategies.

The Apollo family offers practical solutions for real-world applications, from video-based question answering to content analysis and interactive systems. Importantly, Meta AI’s introduction of ApolloBench provides a more streamlined and effective benchmark for evaluating video-LMMs, paving the way for future research.


Check out the Paper, Website, Demo, Code, and Models. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

Trending: LG AI Research Releases EXAONE 3.5: Three Open-Source Bilingual Frontier AI-level Models Delivering Unmatched Instruction Following and Long Context Understanding for Global Leadership in Generative AI Excellence….

The post Meta AI Releases Apollo: A New Family of Video-LMMs Large Multimodal Models for Video Understanding appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Apollo 视频多模态模型 Meta AI 视频理解 人工智能
相关文章