MarkTechPost@AI 09月08日
Meta REFRAG:LLM长上下文处理的效率突破
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Meta Superintelligence Labs推出REFRAG(REpresentation For RAG)框架,旨在革新检索增强生成(RAG)的效率。REFRAG通过轻量级编码器将检索到的文本压缩成固定大小的嵌入,显著缩短了模型处理的序列长度,从而实现了LLM上下文窗口的16倍扩展,并将生成速度提升高达30.85倍,且不牺牲准确性。该框架利用强化学习策略选择性地保留关键信息,确保了模型在处理更长上下文时仍能保持高精度。实验结果表明,REFRAG在各种长上下文任务中均优于现有基线。

💡 **REFRAG框架革新RAG效率**:Meta Superintelligence Labs提出的REFRAG(REpresentation For RAG)是一个创新的解码框架,它通过将检索到的文本块压缩成密集的嵌入表示,从而极大地提高了长上下文处理的效率。这使得LLM能够处理比标准模型长16倍的上下文,并且在生成速度上实现了显著的提升,最高可达30.85倍的时间到首个token(TTFT)加速。

🚀 **加速机制与KV缓存优化**:REFRAG通过大幅缩短输入序列长度,有效降低了模型二次方注意力机制的计算和内存成本。这直接导致了更快的推理速度和更小的键值(KV)缓存占用。在k=16和k=32的设置下,其TTFT加速效果远超现有技术,并且吞吐量也比LLaMA基线模型提高了高达6.78倍。

🧠 **强化学习保障准确性**:为了在压缩过程中保留关键信息,REFRAG引入了基于强化学习的策略。该策略能够识别并允许信息密度最高的文本块跳过压缩,直接输入模型,从而确保了数字、专有名词等细节的准确传递。实验证明,REFRAG在保持或提高准确性的同时,实现了极低的延迟。

📊 **实验验证与性能优势**:REFRAG在SlimPajama语料库上进行了预训练,并在Book、Arxiv、PG19和ProofPile等长上下文数据集上进行了测试。在RAG基准测试、多轮对话和长文档摘要任务中,REFRAG均表现出色,不仅实现了16倍的上下文扩展,还在准确性上比CEPE提高了约9.3%,并在检索器较弱的环境下展现出更优的性能。

Meta Superintelligence Labs has unveiled REFRAG (REpresentation For RAG), a decoding framework that rethinks retrieval-augmented generation (RAG) efficiency. REFRAG extends LLM context windows by 16× and achieves up to a 30.85× acceleration in time-to-first-token (TTFT) without compromising accuracy.

Why is long context such a bottleneck for LLMs?

The attention mechanism in large language models scales quadratically with input length. If a document is twice as long, the compute and memory cost can grow fourfold. This not only slows inference but also increases the size of the key-value (KV) cache, making large-context applications impractical in production systems. In RAG settings, most retrieved passages contribute little to the final answer, but the model still pays the full quadratic price to process them.

How does REFRAG compress and shorten context?

REFRAG introduces a lightweight encoder that splits retrieved passages into fixed-size chunks (e.g., 16 tokens) and compresses each into a dense chunk embedding. Instead of feeding thousands of raw tokens, the decoder processes this shorter sequence of embeddings. The result is a 16× reduction in sequence length, with no change to the LLM architecture.

https://arxiv.org/pdf/2509.01092

How is acceleration achieved?

By shortening the decoder’s input sequence, REFRAG reduces the quadratic attention computation and shrinks the KV cache. Empirical results show 16.53× TTFT acceleration at k=16 and 30.85× acceleration at k=32, far surpassing prior state-of-the-art CEPE (which achieved only 2–8×). Throughput also improves by up to 6.78× compared to LLaMA baselines.

How does REFRAG preserve accuracy?

A reinforcement learning (RL) policy supervises compression. It identifies the most information-dense chunks and allows them to bypass compression, feeding raw tokens directly into the decoder. This selective strategy ensures that critical details—such as exact numbers or rare entities—are not lost. Across multiple benchmarks, REFRAG maintained or improved perplexity compared to CEPE while operating at far lower latency.

What do the experiments reveal?

REFRAG was pretrained on 20B tokens from the SlimPajama corpus (Books + arXiv) and tested on long-context datasets including Book, Arxiv, PG19, and ProofPile. On RAG benchmarks, multi-turn conversation tasks, and long-document summarization, REFRAG consistently outperformed strong baselines:

https://arxiv.org/pdf/2509.01092

Summary

REFRAG shows that long-context LLMs don’t have to be slow or memory-hungry. By compressing retrieved passages into compact embeddings, selectively expanding only the important ones, and rethinking how RAG decoding works, Meta Superintelligence Labs has made it possible to process much larger inputs while running dramatically faster. This makes large-context applications—like analyzing entire reports, handling multi-turn conversations, or scaling enterprise RAG systems—not only feasible but efficient, without compromising accuracy.


FAQs

Q1. What is REFRAG?
REFRAG (REpresentation For RAG) is a decoding framework from Meta Superintelligence Labs that compresses retrieved passages into embeddings, enabling faster and longer-context inference in LLMs.

Q2. How much faster is REFRAG compared to existing methods?
REFRAG delivers up to 30.85× faster time-to-first-token (TTFT) and 6.78× throughput improvement compared to LLaMA baselines, while outperforming CEPE.

Q3. Does compression reduce accuracy?
No. A reinforcement learning policy ensures critical chunks remain uncompressed, preserving key details. Across benchmarks, REFRAG maintained or improved accuracy relative to prior methods.

Q4. Where will the code be available?
Meta Superintelligence Labs will release REFRAG on GitHub at facebookresearch/refrag


Check out the PAPER here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post Meta Superintelligence Labs Introduces REFRAG: Scaling RAG with 16× Longer Contexts and 31× Faster Decoding appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

REFRAG LLM RAG 长上下文 效率 加速 Meta 人工智能 Meta Superintelligence Labs REFRAG LLM RAG Long Context Efficiency Acceleration Meta Artificial Intelligence Meta Superintelligence Labs
相关文章