Llama Nemoretriever Colembed: Top-Performing Text-Image Retrieval Model

cs.AI updates on arXiv.org 07月09日

Llama Nemoretriever Colembed: Top-Performing Text-Image Retrieval Model

本文介绍了一种名为 llama-nemoretriever-colembed 的统一文本图像检索模型，该模型在多个基准测试中实现了最先进的性能，并详细分析了模型架构优化及其在存储和效率上的权衡。

arXiv:2507.05513v1 Announce Type: cross Abstract: Motivated by the growing demand for retrieval systems that operate across modalities, we introduce llama-nemoretriever-colembed, a unified text-image retrieval model that delivers state-of-the-art performance across multiple benchmarks. We release two model variants, 1B and 3B. The 3B model achieves state of the art performance, scoring NDCG@5 91.0 on ViDoRe V1 and 63.5 on ViDoRe V2, placing first on both leaderboards as of June 27, 2025. Our approach leverages the NVIDIA Eagle2 Vision-Language model (VLM), modifies its architecture by replacing causal attention with bidirectional attention, and integrates a ColBERT-style late interaction mechanism to enable fine-grained multimodal retrieval in a shared embedding space. While this mechanism delivers superior retrieval accuracy, it introduces trade-offs in storage and efficiency. We provide a comprehensive analysis of these trade-offs. Additionally, we adopt a two-stage training strategy to enhance the model's retrieval capabilities.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

跨模态检索 NVIDIA Eagle2 VLM ColBERT 风格交互机制多阶段训练策略

相关文章

Apple Releases 4M-21: A Very Effective Multimodal AI Model that Solves Tens of Tasks and Modalities

Jina CLIP v2：多语言多模态的文本图像向量模型

Not All Attention Heads Are What You Need: Refining CLIP's Image Representation with Attention Ablation

ICML 2025 | 从联合空间到文本空间：测试时增强跨模态检索新范式

大模型面试题剖析：大模型多模态微调技术全解析

Cross-Modal Retrieval with Cauchy-Schwarz Divergence

🔥图片搜索文本，语音检索视频？多模态RAG的跨模态检索能力

NeurIPS 2025 | 北邮用“图+文”把人物检索拉满：自动合成数据 × 细粒度特征对齐

NeurIPS 2025 | 北邮用“图+文”把人物检索拉满：自动合成数据 × 细粒度特征对齐

NExT-OMNI: Towards Any-to-Any Omnimodal Foundation Models with Discrete Flow Matching