MarkTechPost@AI 2024年09月26日
Is Scaling the Only Path to AI Supremacy? This AI Paper Unveils ‘Phantom of Latent for Large Language and Vision Models
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Phantom LLVM 是由韩国科学技术院 (KAIST) 研究人员提出的一系列大型语言和视觉模型 (LLVM),旨在解决大型模型在性能和计算效率之间的平衡问题。该模型家族通过引入“幻影维度” (Phantom Dimension) 和“幻影优化” (Phantom Optimization) 技术,在不显著增加模型大小的情况下提升了模型的学习能力。Phantom LLVM 在多个基准测试中表现出色,在图像理解、图表解释和数学推理等任务中超越了许多更大的模型,同时保持了较低的计算成本。

🤔 **“幻影维度”提升学习能力**:Phantom LLVM 采用了一种名为“幻影维度”的技术,在多头自注意力 (MHSA) 期间暂时增加潜在隐藏维度。这使得模型能够嵌入更多视觉语言知识,而无需永久增加模型大小。

💪 **“幻影优化”提升计算效率**:Phantom LLVM 还引入了“幻影优化” (PO),它结合了自回归监督微调 (SFT) 和直接偏好优化 (DPO) 技术,以减少输出中的错误和歧义。这显著提高了计算效率,同时保持了高性能。

🏆 **出色的性能表现**:Phantom LLVM 在多个基准测试中表现出色,在图像理解、图表解释和数学推理等任务中超越了许多更大的模型,例如 Cambrian-1-13B 和 SPHINX-MoE-7B×8。

🚀 **未来潜力**:Phantom LLVM 的创新技术为解决大型视觉语言模型的性能和计算效率之间的平衡问题提供了一种新方法。这种方法有可能将 AI 模型应用于更广泛的现实世界场景,例如增强现实 (AR) 和移动设备。

💡 **模型家族**:Phantom LLVM 家族包含一系列模型,参数量从 0.5B 到 7B 不等,满足不同场景的需求。

Large language and vision models (LLVMs) face a critical challenge in balancing performance improvements with computational efficiency. As models grow in size, reaching up to 80B parameters, they deliver impressive results but require massive hardware resources for training and inference. This issue becomes even more pressing for real-time applications, such as augmented reality (AR), where deploying these large models on devices with limited resources, like mobile phones, is nearly impossible. Overcoming this challenge is essential for enabling LLVMs to function efficiently across various fields without the high computational costs traditionally associated with larger models.

Existing methods to improve the performance of LLVMs typically involve scaling up model size, curating larger datasets, and incorporating additional modules for enhanced vision-language understanding. While these approaches improve accuracy, they impose significant computational burdens, requiring high-end GPUs and substantial VRAM for training and inference. This makes them impractical for real-time applications and resource-limited environments. Additionally, integrating external vision modules adds complexity, further limiting their usability in on-device applications.

The researchers from KAIST propose the Phantom LLVM family, which includes models ranging from 0.5B to 7B parameters. Phantom enhances learning capabilities by temporarily increasing the latent hidden dimension during multi-head self-attention (MHSA), a feature termed “Phantom Dimension.” This innovation allows the model to embed significantly more vision-language knowledge without a permanent increase in model size. Phantom Optimization (PO) is also introduced, combining autoregressive supervised fine-tuning (SFT) with a direct preference optimization (DPO)-like approach to minimize errors and ambiguities in outputs. This approach significantly improves computational efficiency while maintaining high performance.

The Phantom models employ the InternViT-300M as a vision encoder, which aligns text-to-image representations through contrastive learning. The vision projector, constructed using two fully connected layers, adapts the hidden dimension to the corresponding multimodal LLM’s latent space. A core aspect of Phantom is the temporary enlargement of the latent hidden dimension during MHSA, which enhances the model’s ability to embed vision-language knowledge without increasing its physical size. The models are trained using a dataset of 2.8M visual instruction samples, curated into 2M Phantom triples (questions, correct answers, and incorrect or ambiguous answers). These triples play a crucial role in training through PO, improving response accuracy by eliminating confusion.

Phantom exhibits strong performance improvements across several benchmarks, outperforming many larger models in tasks involving image understanding, chart interpretation, and mathematical reasoning. For instance, in benchmarks like SQAI and ChartQA, Phantom’s accuracy exceeds that of larger models such as Cambrian-1-13B and SPHINX-MoE-7B×8. These results demonstrate Phantom’s capability to handle complex vision-language tasks efficiently, all while using a smaller model size. This efficiency is largely due to Phantom Dimension and Phantom Optimization, which allow the model to maximize learning without a proportional increase in computational requirements.

The Phantom LLVM family introduces a new approach to addressing the challenge of balancing performance and computational efficiency in large vision-language models. Through the innovative use of Phantom Dimension and Phantom Optimization, Phantom enables smaller models to perform at the level of much larger models, reducing the computational burden and making these models feasible for deployment in resource-constrained environments. This innovation has the potential to expand the application of AI models across a broader range of real-world scenarios.


Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit

The post Is Scaling the Only Path to AI Supremacy? This AI Paper Unveils ‘Phantom of Latent for Large Language and Vision Models appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Phantom LLVM 视觉语言模型 AI 效率 幻影维度 幻影优化
相关文章