Meta AI Releases Apollo: A New Family of Video-LMMs Large Multimodal Models for Video Understanding

While multimodal models (LMMs) have advanced significantly for text and image tasks, video-based models remain underdeveloped. Videos are inherently complex, combining spatial and temporal dimensions that demand more from computational resources. Existing methods often adapt image-based approaches directly or rely on uniform frame sampling, which poorly captures motion and temporal patterns. Moreover, training large-scale video models is computationally expensive, making it difficult to explore design choices efficiently.

To tackle these issues, researchers from Meta AI and Stanford developed Apollo, a family of video-focused LMMs designed to push the boundaries of video understanding. Apollo addresses these challenges through thoughtful design decisions, improving efficiency, and setting a new benchmark for tasks like temporal reasoning and video-based question answering.

Meta AI Introduces Apollo: A Family of Scalable Video-LMMs

Meta AI’s Apollo models are designed to process videos up to an hour long while achieving strong performance across key video-language tasks. Apollo comes in three sizes – 1.5B, 3B, and 7B parameters – offering flexibility to accommodate various computational constraints and real-world needs.

Key innovations include:

Scaling Consistency

Frame-Per-Second (fps) Sampling

Dual Vision Encoders

ApolloBench

Technical Highlights and Advantages

The Apollo models are built on a series of well-researched design choices aimed at overcoming the challenges of video-based LMMs:

Frame-Per-Second Sampling

Scaling Consistency

Dual Vision Encoders

Token Resampling

Optimized Training

Multi-Turn Conversations

Performance Insights

Apollo’s capabilities are validated through strong results on multiple benchmarks, often outperforming larger models:

Apollo-1.5B

60.8

63.3

57.0

Apollo-3B

58.4

68.7

62.7

55.1

Apollo-7B

61.2

70.9

66.3

Benchmark Summary:

Conclusion

Apollo marks a significant step forward in video-LMM development. By addressing key challenges such as efficient video sampling and model scalability, Apollo provides a practical and powerful solution for understanding video content. Its ability to outperform larger models highlights the importance of well-researched design and training strategies.

The Apollo family offers practical solutions for real-world applications, from video-based question answering to content analysis and interactive systems. Importantly, Meta AI’s introduction of ApolloBench provides a more streamlined and effective benchmark for evaluating video-LMMs, paving the way for future research.

Check out the Paper, Website, Demo, Code, and Models. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

The post Meta AI Releases Apollo: A New Family of Video-LMMs Large Multimodal Models for Video Understanding appeared first on MarkTechPost.

Meta AI Introduces Apollo: A Family of Scalable Video-LMMs

Technical Highlights and Advantages

Performance Insights

Conclusion

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签