ByteByteGo 09月25日 18:01
RAG技术:为AI注入企业私有知识
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

大型语言模型(LLM)在通用场景下表现出色,但在商业环境中面临知识冻结、无法访问私有数据、易产生幻觉以及响应泛化等局限。检索增强生成(RAG)技术通过赋予AI访问特定文档和数据的能力,有效解决了这些问题。RAG系统首先从文档集合中检索相关信息,然后利用这些信息生成准确、具体的回答,如同为AI配备了一本可随时查阅的参考资料。它适用于信息频繁变动、涉及私有数据、精度要求高以及需要溯源的场景,通过文档预处理和查询处理两个阶段,实现了AI在企业级应用中的深度融合。

💡 LLM的局限性:大型语言模型虽然强大,但在企业应用中存在知识时效性、无法访问私有数据、易产生幻觉和响应泛化等核心问题,限制了其在业务场景中的直接可用性。

📚 RAG技术的核心原理:检索增强生成(RAG)通过结合信息检索和文本生成,为LLM提供了访问特定文档和数据集的能力。它首先检索相关信息,再利用这些信息生成更准确、更具上下文关联性的回答,弥补了LLM知识的不足。

🎯 RAG适用场景:RAG技术特别适用于信息更新频繁(如库存、新闻)、涉及企业内部私有或专有数据(如文档、客户记录)、对准确性要求极高(如法律、医疗、金融)以及需要提供信息来源的场景。

⚙️ RAG系统工作流程:RAG系统包含文档准备(文本提取、分块、嵌入、向量数据库存储)和查询处理(查询嵌入、向量数据库检索、上下文组装、LLM生成、后处理)两个主要阶段,确保了效率和准确性。

🔢 嵌入(Embeddings)的重要性:嵌入将文本转化为数值向量,捕捉语义信息,解决了传统关键词搜索的局限性。通过计算向量间的距离,RAG能够快速找到语义上最相关的文档片段,实现精准检索。

✂️ Cut your QA cycles down to minutes with QA Wolf (Sponsored)

If slow QA processes bottleneck you or your software engineering team and you’re releasing slower because of it — you need to check out QA Wolf.

QA Wolf’s AI-native service supports web and mobiles apps, delivering 80% automated test coverage in weeks and helping teams ship 5x faster by reducing QA cycles to minutes.

QA Wolf takes testing off your plate. They can get you:

The benefit? No more manual E2E testing. No more slow QA cycles. No more bugs reaching production.

With QA Wolf, Drata’s team of 80+ engineers achieved 4x more test cases and 86% faster QA cycles.

Schedule a demo to learn more


Large Language Models like GPT-4 and Claude have transformed the way we interact with computers. However, despite their incredible utility in general scenarios, they face fundamental limitations that prevent them from being immediately useful in many business contexts.

Some of these limitations are as follows:

Retrieval-Augmented Generation or RAG solves these problems by giving AI systems access to specific documents and data.

Instead of relying solely on what the model learned during training, RAG allows the system to look up relevant information from a particular document collection before generating a response. Think of it as providing the AI with a reference library it can consult whenever it needs to answer a question.


Help us Make ByteByteGo Newsletter Better

TL:DR: Take this 2-minute survey so I can learn more about who you are,. what you do, and how I can improve ByteByteGo

Take the ByteByteGo Survey


CodeRabbit: Free AI Code Reviews in CLI (Sponsored)

CodeRabbit CLI is an AI code review tool that runs directly in your terminal. It provides intelligent code analysis, catches issues early, and integrates seamlessly with AI coding agents like Claude Code, Codex CLI, Cursor CLI, and Gemini to ensure your code is production-ready before it ships.

Get Started Today


What is RAG?

At its core, Retrieval-Augmented Generation is a technique that combines two different processes into one system:

The name itself tells the story. We retrieve relevant documents first, then augment the generation process with that retrieved context.

Imagine walking into a library and asking the librarian a specific question about the local tax code. A regular librarian might share general knowledge about taxes, but a librarian with access to the city's specific tax documents could walk to the right shelf, pull out the relevant manual, read the specific section, and give an exact answer based on those official documents. This is what RAG does.

The difference between RAG and asking an LLM directly is significant. When we ask a standard LLM about a company's vacation policy, it might respond with generic information about typical vacation policies it learned during training. It might say something like "Many companies offer two to three weeks of paid vacation" because that's a common pattern it may have seen.

With RAG, the system first retrieves the actual employee handbook, finds the section about vacation policy, and then generates a response based on that specific document. The answer would be "According to the employee handbook, full-time employees receive 15 days of paid vacation in their first year, increasing to 20 days after three years of service."

See the diagram below for a high-level view of how RAG works:

Understanding when we need RAG and when a standard LLM is sufficient is an important factor in making good architectural decisions. Some of the cases when RAG is more useful are as follows:

On the other hand, we don't need RAG for general knowledge questions that the LLM can already handle quite well, like explaining common concepts, performing basic reasoning, or creative writing tasks.

How RAG Works - The Journey of a Query

This journey involves two distinct phases that happen at different times:

This two-phase approach is powerful because of the separation of concerns it provides between the computationally intensive document-preparation phase and the latency-sensitive query phase.

Let’s look at both phases in more detail:

1 - Preparation

The document preparation phase is like organizing a library before it opens. This foundational work happens before any user queries arrive and involves several crucial steps. See the diagram below:

The overall process works as follows:

2 - User Query Processing

When a user submits a query, the real-time query processing phase begins. This phase needs to be fast and efficient since users expect quick responses.

See the diagram below for a high level view of the process:

Here’s how it works in more detail:

Embeddings - The Heart of RAG

The fundamental challenge in information retrieval is that people express the same ideas in countless different ways. Traditional keyword search, which looks for exact word matches, fails to capture these variations.

For example, if the document says "The company permits product returns within 30 days", but a user searches for "How long can I send items back?", the keyword search finds nothing despite the obvious relationship between these texts.

Consider the variety of ways someone might ask about a computer problem: "laptop won't start," "computer fails to boot," "system not powering on," or "PC is dead." These phrases share almost no common words, yet they all describe the same issue. A keyword-based system would treat these as completely different queries and miss troubleshooting guides that use different terminology. This vocabulary mismatch problem has plagued information retrieval systems for decades.

Embeddings solve this by capturing semantic meaning rather than surface-level word matches.

Semantic meaning refers to what the text actually means, not just the specific words used. When text gets converted to embeddings, the resulting numbers represent the concepts and ideas in that text. Sentences about returning products end up with similar number patterns, whether they use words like "return," "refund," "send back," or "exchange."

The process of converting text to numbers might seem mysterious, but the principle is straightforward.

See the diagram below:

This numerical representation enables mathematical operations that would be impossible with raw text. Most importantly, we can calculate the distance between two embeddings to measure how similar their meanings are. For example, text about "laptop repairs" and "computer fixes" will have embeddings that are close together in this space, while "laptop repairs" and "cooking recipes" will be far apart. This distance calculation happens through simple mathematics, making it extremely fast even with millions of documents.

The reason similar meanings create similar number patterns comes from how embedding models are trained.

During training, the model sees millions of examples of text and learns that certain words and phrases appear in similar contexts. For example, words like "doctor" and "physician" show up in similar sentences, get used interchangeably, and relate to the same concepts. The model learns to assign them similar numerical patterns. This learning happens automatically through exposure to vast amounts of text, without anyone explicitly programming these relationships.

What makes embeddings particularly fascinating is that we don't fully understand what each dimension represents. When an embedding model outputs 768 numbers for a piece of text, we can't simply say dimension 1 represents "formality" or dimension 547 represents "technical complexity." These dimensions emerge naturally during training as the model figures out what patterns it needs to track to understand language effectively. Some dimensions might loosely correlate with concepts we recognize, like sentiment or topic, but many capture abstract patterns that don't map to any concept we have words for.

It's important to understand that embedding models and large language models serve completely different purposes in an RAG system.

This specialization is why RAG systems use two separate models. The embedding model efficiently converts all the documents and queries into vectors, enabling fast similarity search. The LLM then takes the retrieved relevant documents and generates intelligent, contextual responses.

Building an RAG System

When building a retrieval-augmented generation (RAG) system, the first step is understanding the requirements clearly. As with most systems, everything begins with the users. Some questions that should be asked are as follows:

Next, we need to look closely at the document landscape. The scale matters in terms of handling a hundred files or hundreds of thousands. Different volumes demand different storage and retrieval strategies. The possible types of content (PDFs, Word docs, Confluence pages, or Notion databases) determine the ingestion and preprocessing pipelines. Equally important is understanding the query patterns by answering questions as follows:

The answer to these questions defines how sophisticated the system must be. Once requirements are clear, we can move to the technology stack. Some of the most popular tools and technologies are as follows:

Conclusion

Retrieval-Augmented Generation represents a practical solution to the very real limitations of LLMs in business applications. By combining the power of semantic search through embeddings with the generation capabilities of LLMs, RAG enables AI systems to provide accurate, specific answers based on the organization's own documents and data.

Understanding RAG's core concepts helps make informed decisions about whether it's right for a particular use case. If we need AI that can access private company information, provide current updates, cite sources, or maintain strict accuracy, RAG is likely the answer. The two-phase architecture of document preparation and query processing makes it scalable and efficient, while the use of embeddings ensures that users find relevant information regardless of how they phrase their questions.

The field of RAG continues evolving rapidly, with improvements in retrieval techniques, better embedding models, and more sophisticated generation strategies. However, the fundamental principles covered here remain constant.


SPONSOR US

Get your product in front of more than 1,000,000 tech professionals.

Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.

Space Fills Up Fast - Reserve Today

Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing sponsorship@bytebytego.com.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

RAG Retrieval-Augmented Generation 大型语言模型 LLM 企业知识 AI 信息检索 向量数据库 Embeddings
相关文章