NVIDIA NeMo Agent Toolkit助力AI应用规模化部署

You’ve built a powerful AI agent and are ready to share it with your colleagues, but have one big fear: Will the agent work if 10, 100, or even 1,000 coworkers try to use it at the same time? Answering this critical question is a key part of bringing an AI agent to production. We recently faced this question as part of our internal deployment of a deep-research agent using the AI-Q NVIDIA Blueprint, an agentic application built using LangGraph.

This post will cover the tools and techniques from the NVIDIA NeMo Agent Toolkit we used to deploy and scale our agentic AI application into production.

How to build a secure, scalable deep-researcher

The use of deep-research applications is ubiquitous, with many individuals regularly using tools like Perplexity, ChatGPT, or Gemini. However, like many organizations, using these deep research tools with NVIDIA confidential information can be tricky. For this reason, NVIDIA earlier this year released an open source blueprint for building a deep-research application that can be deployed on-premise. This blueprint was the starting point for our internal production deployment of a deep-research assistant.

Architecture

The AI-Q research agent allows users to upload documents and extract their metadata, access internal data sources, and search the web to create research reports. The blueprint is implemented using the NeMo Agent Toolkit and uses a variety of NVIDIA NeMo Retriever models for document ingest, retrieval, and large language model (LLM) invocations.

Our production deployment uses an internal OpenShift cluster following our AI factory reference architecture, with access to locally deployed NVIDIA NIM microservices and third-party observability tools. Our challenge was identifying what parts of the system needed to scale to support a rollout to hundreds of users across different NVIDIA teams.

*Figure 1. AI-Q research agent blueprint architecture diagram*

To address this challenge we followed a three-step process using tools from the NeMo Agent Toolkit at each phase:

Profile the application as a single user to identify bottlenecks.Run a load test to collect data and estimate the architecture needed for hundreds of users.Monitor the application during a phased rollout.

Step 1: How do you profile and optimize a single agentic application?

One challenge with bringing an agentic application to production is that every agentic application is different. It is difficult to create generic guidelines like “an AI application will need one GPU per 100 users.” Instead, the first step to scaling out an application is to deeply understand how the application works for one user. The NeMo Agent Toolkit offers an evaluation and profiling system to make it easy to gather data and come to a quantitative understanding of the application’s behavior.

To use the evaluation and profiling tool, simply add an evaluation section to your application’s config file. The eval config includes a dataset that contains sample user inputs for the application. Agentic applications are not deterministic, so it can be useful to profile various inputs to understand how the application will perform across a wide variety of inputs a user might provide.

eval:  general:    output_dir: single_run_result    dataset:      _type: json      file_path: example_inputs.json    profiler:      # Compute inter query token uniqueness      token_uniqueness_forecast: true      # Compute expected workflow runtime      workflow_runtime_forecast: true      # Compute inference optimization metrics      compute_llm_metrics: true      # Compute bottleneck metrics      bottleneck_analysis:        enable_nested_stack: true      concurrency_spike_analysis:        enable: true        spike_threshold: 7

The AI-Q research agent is a LangGraph application that uses the NeMo Agent Toolkit function wrappers. These wrappers allow the profiler to automatically capture timing and token usage for different parts of the application. We can also track sub-steps within the application by adding simple decorators to the functions we care about.

from aiq.profiler.decorators.function_tracking import track_function@track_function(metadata={"source": "custom_function"})def my_custom_function(a, b):  return a + b

The eval command runs the workflow across the input dataset and collects/calculates a variety of useful metrics.

aiq eval --config_file configs/eval_config.yml

One example of the available output is a Gantt (or Waterfall) chart. The chart shows which functions are executing during each part of a user session. This information allowed us to identify what parts of our application were likely to become a bottleneck. For the AI-Q research agent, the main bottleneck were calls to the NVIDIA Llama Nemotron Super 49B reasoning LLM. Knowing the bottleneck allowed us to focus on replicating and scaling out the deployment of the NVIDIA NIM for that LLM.

*Figure 3. Comparison of metrics between two different feature branches*

Step 2: Can your architecture handle 200 users? Estimating your needs

After understanding and optimizing the application performance for one user, we were ready to take the next step: load testing across multiple users. The goals of the load test were (a) run the application at a higher concurrency, (b) fix anything that broke, and (c) collect data to inform the requirements for our final deployment.

To understand what architecture would support 200 concurrent users, we ran a load test of 10, 20, 30, 40, and 50 concurrent users with our available hardware. The data collected during the load test was then used to forecast the hardware needs for the full deployment.

To perform the load test we used the NeMo Agent Toolkit sizing calculator.

Capture concurrency data

The toolkit sizing calculator works by using the same profiling and evaluation tool to run simulated workflows, but in parallel at different concurrency levels.

aiq sizing calc  --calc_output_dir $CALC_OUTPUT_DIR  --concurrencies 1,2,4,8,16,32 --num_passes 2

The calculator captures a variety of metrics during the load test, including p95 timing for each LLM invocation and p95 timing for the workflow as a whole. *Note, the output depicted below is for a toolkit example, not the actual data for the internal deep research agent load test.

*Figure 4. Timing data captured by the NeMo Agent Toolkit sizing calculator*

Forecast for scale out

After capturing data at different concurrencies, we can understand how many users our existing architecture and hardware can support. For example, in the output below, assume we run our load test on one GPU. The results tell us one GPU can support 10 concurrent users within our latency threshold. With that information we can extrapolate the need for 10 GPUs for 100 concurrent users.

*Figure 5. Forecast of hardware needs from the NeMo Agent Toolkit sizing calculator*

Other learnings

The other benefit of performing a load test is that it helps uncover bottlenecks or bugs in the application that may not be obvious from a single user run. In our initial load test of the AI-Q research agent, for example, we identified and corrected two bugs:

1. We monitored hardware metrics during the load test and found that one of the NVIDIA NIM microservices was using 100% of its allocated CPU. This finding helped us fix the root cause, which was a misconfiguration in our helm chart that had deployed the NIM with fewer CPUs than intended.

*Figure 6. CPU starvation during a stress test*

2. We identified a number of places where the application would fail if the LLM call timed out. We were able to add retries and better error handling so that intermittent failures would not break the entire user experience, allowing for more graceful degradation.

try: async with asyncio.timeout(ASYNC_TIMEOUT):  async for chunk in chain.astream(input, stream_usage=True):    answer_agg += chunk.content      if "</think>" in chunk.content:        stop = True      if not stop:       writer({"generating_questions": chunk.content})except asyncio.TimeoutError as e: writer({"generating_questions": "Timeout error from reasoning LLM, please try again"})return {"queries": []}

Step 3: How to monitor, trace, and optimize your research agent’s performance as you scale up to production

With all this information in hand, we were able to deploy the AI-Q research agent with the appropriate number of replicas across various system components. As a final step, we scaled out using a phased approach—starting with small teams and gradually adding additional users. During the rollout, it was critical to observe application performance. We used the NeMo Agent Toolkit OpenTelemetry (OTEL) collector along with Datadog to capture logs, performance data, and LLM trace information.

general:  telemetry:    tracing:   otelcollector:     _type: otelcollector       # Your otel collector endpoint       endpoint: http://0.0.0.0:4318/v1/traces       project: your_project_name

The OTEL collector integration allows us to view specific traces for individual user sessions, helping us understand both application performance and LLM behavior.

*Figure 7. Datadog flame graph showing timing for a real user session*

We are also able to aggregate performance data across traces to understand how the application was performing. The following chart shows average latency and user sessions with outlying performance.

*Figure 8. Datadog latency analysis showing p95 times and outliers for individually tracked functions.*

Conclusion

By using the NeMo Agent Toolkit in conjunction with a variety of AI factory reference partners, we were able to deploy an internal version of the AI-Q NVIDIA Blueprint and build a research agent with confidence.

Learn more about building with NeMo Agent Toolkit or try out the AI-Q NVIDIA research agent blueprint for yourself.

How to build a secure, scalable deep-researcher

Architecture

Step 1: How do you profile and optimize a single agentic application?

Step 2: Can your architecture handle 200 users? Estimating your needs

Capture concurrency data

Forecast for scale out

Other learnings

Step 3: How to monitor, trace, and optimize your research agent’s performance as you scale up to production

Conclusion

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签