构建智能RAG代理：利用NVIDIA Nemotron和LangGraph

Unlike traditional LLM-based systems that are limited by their training data, retrieval-augmented generation (RAG) improves text generation by incorporating relevant external information. Agentic RAG goes a step further by leveraging autonomous systems integrated with LLMs and retrieval mechanisms. This allows these systems to make decisions, adapt to changing requirements, and perform complex reasoning tasks dynamically.

In this guide to the self-paced workshop for building a RAG agent, you’ll gain:

Nemotron

NVIDIA Launchable

Video walkthrough

Video 1. Build a RAG Agent with NVIDIA Nemotron

Opening the workshop

Launch the workshop as an NVIDIA Launchable:

*Figure 1. Click on the ‘Deploy Now’ button to deploy the NVIDIA DevX Workshop in the cloud*

With your Jupyter Lab environment running, locate the NVIDIA DevX Learning Path section of the Jupyterlab Launcher. Select the Agentic RAG tile to open up the lab instructions and get started.

*Figure 3. Use the “Secrets Manager” tile under the NVIDIA DevX Learning Path section to configure project secrets (API keys).*

Introduction to RAG architecture

Once your workshop environment has been set up, the next step is understanding the architecture of the agentic RAG system you’ll build.

RAG enhances the capabilities of LLMs by incorporating relevant external information during output text generation. Traditional language models generate responses based solely on the knowledge captured in their training data, which can be a limiting factor, especially when dealing with rapidly changing information, highly specialized knowledge domains, or enterprise confidential data. RAG, on the other hand, is a powerful tool for generating responses based on relevant unstructured data retrieved from an external knowledge base.

The typical flow for a RAG system is:

Prompt

Embedding Model

Vector Database Search

Reranking Model

LLM

This approach ensures that the language model can access up-to-date and specific information beyond its training data, making it more versatile and effective.

Understanding ReAct agent architecture

Unlike traditional LLM-based applications, agents can dynamically choose tools, incorporate complex reasoning, and adapt their analysis approach based on the situation at hand.

*Figure 5. A ReAct agent can iteratively reason and call out to user-defined tools to generate a higher quality RAG-based response.*

ReAct Agents are a simple agentic architecture that use “reasoning and acting” via tool calling supported LLMs. If the LLM requests any tool calls after taking in the prompt, those tools will be run, added to the chat history, and sent back to the model to be invoked again.

RAG works well, but it’s limited because the LLM can’t determine how data is retrieved, control for data quality, or choose between data sources. Agentic RAG takes the concept of RAG a step further by combining the strengths of LLMs such as language comprehension, contextual reasoning, and flexible generation, with dynamic tool usage, and advanced retrieval mechanisms such as semantic search, hybrid retrieval, reranking, and data source selection. Making a ReAct Agent for RAG just requires giving it the Retrieval Chain as a tool so the agent can decide when and how to search for information.

*Figure 6. The full agentic RAG pipeline will involve adding the ReAct agent to the Retrieval Chain where the contextual documents are stored.*

Agentic RAG employs a ReAct agent architecture in which the reasoning LLM systematically decides whether to retrieve information via tool calling or respond directly, activating the retrieval pipeline only when additional context is needed to better address the user’s request.

Learn and implement the code

Now that we understand the concepts, let’s dive into the technical implementation. We’ll start with the foundational components before building up to the complete agentic RAG system:

ModelsToolsData IngestionText SplittingVector Database IngestionDocument Retriever and RerankerRetriever Tool CreationAgent Configuration

Foundations: the models

The workshop relies on NVIDIA NIM endpoints for the core model powering the agent. NVIDIA NIM provides high-performance inference capabilities, including:

Tool binding

Structured output

Async operations

Enterprise reliability

This example shows the ChatNVIDIA LangChain connector using NVIDIA NIM.

from langchain_nvidia_ai_endpoints import ChatNVIDIALLM_MODEL = "nvidia/nvidia-nemotron-nano-9b-v2"llm = ChatNVIDIA(model=LLM_MODEL, temperature=0.6, top_p=0.95, max_tokens=8192)

To ensure the quality of the LLM-based application, it’s crucial that the agent receives clear instructions to clarify decision-making, remove ambiguity, and clarify how it should treat retrieved documents. One such example from code/rag_agent.py is provided as follows:

SYSTEM_PROMPT = (    "You are an IT help desk support agent.\n"    "- Use the 'company_llc_it_knowledge_base' tool for questions likely covered by the internal IT knowledge base.\n"    "- Always write grounded answers. If unsure, say you don't know.\n"    "- Cite sources inline using [KB] for knowledge base snippets.\n"    "- If the knowledge base doesn't contain sufficient information, clearly state what information is missing.\n"    "- Keep answers brief, to the point, and conversational.")

This prompt shows a few key principles of reliable LLM prompting for RAG-based applications:

Role specification

Tool Utilization

Grounding:

Source Citation:

Communication Style:

In code/rag_agent.py we define the models necessary for the IT Help Desk agent to answer user queries by utilizing the Knowledge Base.

Nemotron Nano 9b V2

Llama 3.2 EmbedQA 1b V2

Llama 3.2 RerankQA 1b V2

These models collectively enable the IT Help Desk agent to answer user queries accurately by leveraging a combination of language generation, document retrieval, and reranking capabilities.

Our RAG agent will have access to the knowledge base provided at ./data/it-knowledge-base that contains markdown files documenting common IT-related procedures. The retriever tool enables the agent to search the internal IT knowledge base for documents relevant to the user’s query.

A vector database stores, indexes, and queries numerical representations of vectorized embeddings, allowing for fast similarity searches of unstructured data like text, images, and audio. For our purposes, we use an in-memory FAISS database, which is efficient for spinning up small databases. In terms of data ingestion to‌ utilize the data in the knowledge base, we’ll focus on text ingestion. Additional features like multimodality should be considered for production use cases.

Foundations: data ingestion

The embedding model utilized is NeMo Retriever llama-3.2-nv-embedqa-1b-v2. This model creates embeddings for documents and queries that help in efficiently retrieving relevant documents from the knowledge base by comparing the semantic similarity between the query and the documents.

To ingest the documents, we’ll chunk the documents, embed those chunks into vectors, and then insert the vectors into the database. Before doing that, we need to load the data from our ./data/it-knowledge-base directory using the LangChain DirectoryLoader.

from langchain_community.document_loaders import DirectoryLoader, TextLoader# Read the data_LOGGER.info(f"Reading knowledge base data from {DATA_DIR}")data_loader = DirectoryLoader(    DATA_DIR,    glob="**/*",    loader_cls=TextLoader,    show_progress=True,)docs = data_loader.load()

Foundations: text splitting

Document splitting is controlled by two things: chunk size and chunk overlap.

Chunk size defines the maximum length of each text chunk. This ensures that each chunk is of an optimized size for processing by language models and retrieval systems. A chunk size that is too large may contain information less relevant to specific queries, while one too small may miss important context.

Chunk overlap defines the number of tokens that overlap between consecutive chunks. The goal is to ensure continuity and preserve context across chunks, thereby maintaining coherence in the retrieved information.

To perform text splitting efficiently, we use the RecursiveCharacterTextSplitter. This tool recursively splits documents into smaller chunks based on character length, so each chunk adheres to the defined chunk size and overlap parameters. It’s particularly useful for processing large documents, improving the information retrieval’s overall accuracy.

from langchain.text_splitter import RecursiveCharacterTextSplitterCHUNK_SIZE = 800CHUNK_OVERLAP = 120_LOGGER.info(f"Ingesting {len(docs)} documents into FAISS vector database.")splitter = RecursiveCharacterTextSplitter(    chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP)chunks = splitter.split_documents(docs)

Foundations: vector database ingestion

To facilitate efficient retrieval of relevant information, we need to ingest our large corpus of documents into a vector database. Now that we have broken down our documents into manageable chunks, we utilize the embedding model to generate vector embeddings for each document chunk.

These embeddings are numerical representations of the semantic content of the chunks. High-quality embeddings enable efficient similarity searches, allowing the system to quickly identify and retrieve the most relevant chunks in response to a user’s query.

The next step is to store the generated embeddings in an in-memory FAISS database, which ensures fast indexing and querying capabilities for real-time information retrieval. In this example, we leverage the fact that LangChain’s FAISS `from_documents` method conveniently generates the embeddings for the document chunks and also stores them in the FAISS vector store in one function call.

from langchain_community.vectorstores import FAISSfrom langchain_nvidia_ai_endpoints import NVIDIAEmbeddings,embeddings = NVIDIAEmbeddings(model=RETRIEVER_EMBEDDING_MODEL, truncate="END")vectordb = FAISS.from_documents(chunks, embeddings)

By following these steps and taking advantage of the power of the embedding model, we ensure that the IT Help Desk agent can efficiently retrieve and process relevant information from the knowledge base.

Foundations: document retriever and reranker

With our vector database populated, we can build a chain for content retrieval. This involves creating a seamless workflow that includes both the embedding step and the lookup step.

*Figure 7. A basic retrieval chain consists of an embedding model and a database to store the converted vector embeddings.*

In the embedding step, user queries are converted into embeddings using the same model that we previously used for document chunks. This ensures that both the queries and document chunks are represented in the same semantic space, enabling accurate similarity comparisons.

To initialize the retriever in this example, we’ll use semantic similarity and search for the top six returned results compared to our query.

# imports already handledkb_retriever = vectordb.as_retriever(search_type="similarity", search_kwargs={"k": 6})

The embeddings of the user’s queries are compared against the embeddings stored in the vector database during the lookup step. The system retrieves the most similar document chunks, which are then used to generate responses.

*Figure 8. A more complex retrieval chain consists of attaching a Reranking model to reorganize retrieved context to put the most relevant chunks first.*

For both the embedding and the reranking models, we’ll use NIM microservices from NVIDIA NeMo Retriever. LangChain allows us to easily create a basic retrieval chain from our Vector Database object that has both the embedding step and the lookup step.

For improving the relevance and order of retrieved documents, we can utilize the NVIDIA Rerank class, built on the NVIDIA NeMo Retriever Reranker model. The Reranker model evaluates and ranks the retrieved document chunks based on their relevance to the user’s query so that the most pertinent information is presented to the user first. In this example, we initialize the Reranker as follows:

from langchain_nvidia_ai_endpoints import NVIDIARerankreranker = NVIDIARerank(model=RETRIEVER_RERANK_MODEL)

Taking the document retriever and the documenter reranker, we can now create the final document retriever as below:

RETRIEVER = ContextualCompressionRetriever(    base_retriever=kb_retriever,    base_compressor=reranker,)

The LangChain ContextualCompressionRetriever makes it easy to combine a retriever with additional processing steps, attaching the retrieval chain to the reranking model. Now we can create the retriever tool that enables our ReAct Agent.

In this example, we can initialize the retriever tool by using the LangChain tools package below, passing in our initialized retriever:

from langchain.tools.retriever import create_retriever_toolRETRIEVER_TOOL = create_retriever_tool(    retriever=RETRIEVER,    name="company_llc_it_knowledge_base",    description=(        "Search the internal IT knowledge base for Company LLC IT related questions and policies."    ),)

Foundations: agent configuration

With our vector database and retriever chain in place, we’re ready to construct the agent graph. This agent graph acts as a kind of flowchart, mapping out the possible steps the model can take to accomplish its task. In traditional, step-by-step LLM applications, these are called “chains.” When the workflow involves more dynamic, non-linear decision-making, we refer to them as “graphs.” The agent can choose different paths based on the context and requirements of the task at hand, branching out into different decision nodes.

Given the prevalence of the ReAct agent architecture, LangGraph provides a function that’ll create ReAct Agent Graphs. In this example, we utilized as below:

from langgraph.prebuilt import create_react_agentAGENT = create_react_agent(    model=llm,    tools=[RETRIEVER_TOOL],    prompt=SYSTEM_PROMPT,)

By constructing an agent graph, we create a dynamic and flexible workflow that enables our IT Help Desk agent to handle complex decision-making processes. This approach ensures that the agent can efficiently retrieve and process information, provide accurate responses, and adapt to various scenarios.

Running your agent

Congratulations! You have successfully built your agent! Now, the next step is to try it out.

To get started with running your agent from your terminal, cd into the code directory that has the Python file containing your code for the agent. Once there, start your Agent API with the LangGraph CLI. Your agent will automatically reload as you make changes and save your code.

To chat with your agent, a simple Streamlit app has been included in the Simple Agents Client. You can also access the Streamlit Client from the Jupyter Launcher page. In the sidebar, ensure the rag_agent client is selected and try chatting!

*Figure 9. Click on the “Simple Agents Client” tile in NVIDIA DevX Learning Path to open the Streamlit chat application.*

As your agents become more sophisticated, managing their internal complexity can become difficult. Tracing helps visualize each step your agent takes, which makes it easier to debug and optimize your agent’s behavior. In the workshop, you can optionally configure the LANGSMITH_API_KEY and view traces on the LangSmith dashboard.

Migrate to local NIM microservices

This workshop utilizes the nvidia-nemotron-nano-9b-v2 LLM from the NVIDIA API Catalog. These APIs are useful for evaluating many models, quick experimentation, and getting started is free. However, for the unlimited performance and control needed in production, deploy models locally with NVIDIA NIM microservice containers.

In a typical development workflow, both your agent and NIM containers would run in the background, allowing you to multitask and iterate quickly. For this exercise, we can run the NIM in the foreground to easily monitor its output and ensure proper start up.

First, you need to log in to the NGC container registry as follows:

echo $NVIDIA_API_KEY | \  docker login nvcr.io \  --username '$oauthtoken' \  --password-stdin

The next step is to create a location for NIM containers to save their downloaded model files.

docker volume create nim-cache

Now, we need to use a Docker run command to pull the NIM container image and model data files before hosting the model behind a local, OpenAI-compliant API.

docker run -it --rm \    --name nemotron \    --network workbench \    --gpus 1 \    --shm-size=16GB \    -e NGC_API_KEY=$NVIDIA_API_KEY \    -v nim-cache:/opt/nim/.cache \    -u $(id -u) \    -p 8000:8000 \    nvcr.io/nim/nvidia/nvidia-nemotron-nano-9b-v2:latest

After letting it run for a few minutes, you’ll know the NIM is ready for inference when it says Application startup complete.

INFO 2025-09-10 16:31:52.7 on.py:48] Waiting for application startup.INFO 2025-09-10 16:31:52.239 on.py:62] Application startup complete.INFO 2025-09-10 16:31:52.240 server.py:214] Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)...INFO 2025-09-10 16:32:05.957 metrics.py:386] Avg prompt throughput: 0.2 tokens/s, Avg generation throughput: 1.1 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.

Now that your NIM is running locally, we need to update the agent you created in rag_agent.py to use it.

llm = ChatNVIDIA(    base_url="http://nemotron:8000/v1",    model=LLM_MODEL,    temperature=0.6,    top_p=0.95,    max_tokens=8192)

With your langgraph server still running, go back to our Simple Agents Client and try prompting the agent again. If everything was successful, you should notice no change!

Congratulations! You have now migrated to using Local NIM microservices for your LangGraph Agent!

Conclusion and next steps

This workshop provides a comprehensive path from basic concepts to sophisticated agentic systems, emphasizing hands-on learning with production-grade tools and techniques.

By completing this workshop, developers gain practical experience with:

Fundamental concepts:

State management:

Tool integration:

Modern AI stack:

Learn More

For hands-on learning, tips, and tricks, join our Nemotron Labs livestream, “Build a RAG Agent with NVIDIA Nemotron” on Tuesday, September 23, 2025 at 9am PT.

Stay up to date on Agentic AI, Nemotron, and more by subscribing to NVIDIA news, joining the community, and following NVIDIA AI on LinkedIn, Instagram, X, and Facebook.

Explore the self-paced video tutorials and livestreams here.