Nvidia Developer 09月03日
NVIDIA NeMo Agent Toolkit助力AI应用规模化部署
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文详细介绍了NVIDIA如何利用NeMo Agent Toolkit成功部署并规模化其AI-Q深度研究代理应用。文章阐述了从单用户性能分析、多用户负载测试到生产环境监控和优化的完整流程。通过NeMo Agent Toolkit的评估和剖析系统,开发者能够识别性能瓶颈,如LLM调用延迟,并进行针对性优化。负载测试则通过模拟并发用户,收集数据以预测硬件需求,确保应用能支持大规模用户同时使用。此外,文章还强调了OTEL集成在监控和追踪中的作用,帮助在部署过程中持续观察和调整性能。最终,该方法论使NVIDIA能够自信地将AI代理部署到生产环境中。

🔍 **性能剖析与瓶颈识别**:NVIDIA利用NeMo Agent Toolkit的评估和剖析系统,深入分析了AI-Q深度研究代理的单用户性能,识别出NVIDIA Llama Nemotron Super 49B推理LLM是主要的性能瓶颈。通过Gantt图等可视化工具,开发者能够精确追踪应用各部分的执行时间和资源消耗,为后续优化提供依据。

📈 **负载测试与容量规划**:在理解单用户性能后,通过NeMo Agent Toolkit的Sizing Calculator工具进行负载测试,模拟不同并发用户数。收集到的数据(如LLM调用和整体工作流的p95延迟)被用来预测支持数百用户的架构所需硬件资源,例如推断出支持100个并发用户可能需要10个GPU。

🛠️ **应对生产部署挑战**:在负载测试过程中,发现了如NIM微服务CPU资源不足和LLM调用超时等问题。通过调整Helm Chart配置、增加重试机制和改进错误处理,解决了这些潜在的生产环境故障,确保了应用的稳定性和用户体验的优雅降级。

📊 **生产环境监控与持续优化**:在分阶段部署过程中,集成了NeMo Agent Toolkit的OpenTelemetry (OTEL) Collector和Datadog,用于捕获日志、性能数据和LLM追踪信息。这使得团队能够实时监控应用性能,分析个体用户会话的追踪数据,以及聚合跨会话的性能指标,从而在生产环境中持续优化和维护。

💡 **关键工具与方法论**:文章强调了NeMo Agent Toolkit在整个AI应用规模化部署中的核心作用,包括评估和剖析、Sizing Calculator以及OTEL集成。结合AI Factory参考架构,这套工具和方法论为构建安全、可扩展的AI代理应用提供了坚实的基础。

You’ve built a powerful AI agent and are ready to share it with your colleagues, but have one big fear: Will the agent work if 10, 100, or even 1,000 coworkers try to use it at the same time? Answering this critical question is a key part of bringing an AI agent to production. We recently faced this question as part of our internal deployment of a deep-research agent using the AI-Q NVIDIA Blueprint, an agentic application built using LangGraph.

This post will cover the tools and techniques from the NVIDIA NeMo Agent Toolkit we used to deploy and scale our agentic AI application into production.

How to build a secure, scalable deep-researcher

The use of deep-research applications is ubiquitous, with many individuals regularly using tools like Perplexity, ChatGPT, or Gemini. However, like many organizations, using these deep research tools with NVIDIA confidential information can be tricky. For this reason, NVIDIA earlier this year released an open source blueprint for building a deep-research application that can be deployed on-premise. This blueprint was the starting point for our internal production deployment of a deep-research assistant.

Architecture

The AI-Q research agent allows users to upload documents and extract their metadata, access internal data sources, and search the web to create research reports. The blueprint is implemented using the NeMo Agent Toolkit and uses a variety of NVIDIA NeMo Retriever models for document ingest, retrieval, and large language model (LLM) invocations.

Our production deployment uses an internal OpenShift cluster following our AI factory reference architecture, with access to locally deployed NVIDIA NIM microservices and third-party observability tools. Our challenge was identifying what parts of the system needed to scale to support a rollout to hundreds of users across different NVIDIA teams.

Figure 1. AI-Q research agent blueprint architecture diagram

To address this challenge we followed a three-step process using tools from the NeMo Agent Toolkit at each phase:

    Profile the application as a single user to identify bottlenecks.Run a load test to collect data and estimate the architecture needed for hundreds of users.Monitor the application during a phased rollout.

Step 1: How do you profile and optimize a single agentic application?

One challenge with bringing an agentic application to production is that every agentic application is different. It is difficult to create generic guidelines like “an AI application will need one GPU per 100 users.” Instead, the first step to scaling out an application is to deeply understand how the application works for one user. The NeMo Agent Toolkit offers an evaluation and profiling system to make it easy to gather data and come to a quantitative understanding of the application’s behavior.

To use the evaluation and profiling tool, simply add an evaluation section to your application’s config file. The eval config includes a dataset that contains sample user inputs for the application. Agentic applications are not deterministic, so it can be useful to profile various inputs to understand how the application will perform across a wide variety of inputs a user might provide.

eval:  general:    output_dir: single_run_result    dataset:      _type: json      file_path: example_inputs.json    profiler:      # Compute inter query token uniqueness      token_uniqueness_forecast: true      # Compute expected workflow runtime      workflow_runtime_forecast: true      # Compute inference optimization metrics      compute_llm_metrics: true      # Compute bottleneck metrics      bottleneck_analysis:        enable_nested_stack: true      concurrency_spike_analysis:        enable: true        spike_threshold: 7


The AI-Q research agent is a LangGraph application that uses the NeMo Agent Toolkit function wrappers. These wrappers allow the profiler to automatically capture timing and token usage for different parts of the application. We can also track sub-steps within the application by adding simple decorators to the functions we care about.

from aiq.profiler.decorators.function_tracking import track_function@track_function(metadata={"source": "custom_function"})def my_custom_function(a, b):  return a + b

The eval command runs the workflow across the input dataset and collects/calculates a variety of useful metrics.

aiq eval --config_file configs/eval_config.yml

One example of the available output is a Gantt (or Waterfall) chart. The chart shows which functions are executing during each part of a user session. This information allowed us to identify what parts of our application were likely to become a bottleneck. For the AI-Q research agent, the main bottleneck were calls to the NVIDIA Llama Nemotron Super 49B reasoning LLM. Knowing the bottleneck allowed us to focus on replicating and scaling out the deployment of the NVIDIA NIM for that LLM.

Evaluating accuracy

In addition to capturing timing and token usage, the evaluation and profiling tool can compute evaluation metrics. In our case, it wasn’t just important to have an app that was fast and responsive for many users, but it also needed to generate useful reports. We created custom metrics relevant to our deep research use case and used the profiling and evaluation tool to benchmark different versions of the application code. This benchmarking ensured any optimizations we made did not reduce report quality. The toolkit reports metrics in a variety of formats, but a particularly useful option is exporting them to a platform like Weights and Biases to track and visualize experiments over time.

Figure 3. Comparison of metrics between two different feature branches

Step 2: Can your architecture handle 200 users? Estimating your needs

After understanding and optimizing the application performance for one user, we were ready to take the next step: load testing across multiple users. The goals of the load test were (a) run the application at a higher concurrency, (b) fix anything that broke, and (c) collect data to inform the requirements for our final deployment. 

To understand what architecture would support 200 concurrent users, we ran a load test of 10, 20, 30, 40, and 50 concurrent users with our available hardware. The data collected during the load test was then used to forecast the hardware needs for the full deployment.

To perform the load test we used the NeMo Agent Toolkit sizing calculator.

Capture concurrency data 

The toolkit sizing calculator works by using the same profiling and evaluation tool to run simulated workflows, but in parallel at different concurrency levels.

aiq sizing calc  --calc_output_dir $CALC_OUTPUT_DIR  --concurrencies 1,2,4,8,16,32 --num_passes 2

The calculator captures a variety of metrics during the load test, including p95 timing for each LLM invocation and p95 timing for the workflow as a whole. *Note, the output depicted below is for a toolkit example, not the actual data for the internal deep research agent load test.

Figure 4. Timing data captured by the NeMo Agent Toolkit sizing calculator

Forecast for scale out

After capturing data at different concurrencies, we can understand how many users our existing architecture and hardware can support. For example, in the output below, assume we run our load test on one GPU. The results tell us one GPU can support 10 concurrent users within our latency threshold. With that information we can extrapolate the need for 10 GPUs for 100 concurrent users.

Figure 5. Forecast of hardware needs from the NeMo Agent Toolkit sizing calculator

Other learnings

The other benefit of performing a load test is that it helps uncover bottlenecks or bugs in the application that may not be obvious from a single user run. In our initial load test of the AI-Q research agent, for example, we identified and corrected two bugs:

1. We monitored hardware metrics during the load test and found that one of the NVIDIA NIM microservices was using 100% of its allocated CPU. This finding helped us fix the root cause, which was a misconfiguration in our helm chart that had deployed the NIM with fewer CPUs than intended.

    Figure 6. CPU starvation during a stress test

    2. We identified a number of places where the application would fail if the LLM call timed out. We were able to add retries and better error handling so that intermittent failures would not break the entire user experience, allowing for more graceful degradation.

    try: async with asyncio.timeout(ASYNC_TIMEOUT):  async for chunk in chain.astream(input, stream_usage=True):    answer_agg += chunk.content      if "</think>" in chunk.content:        stop = True      if not stop:       writer({"generating_questions": chunk.content})except asyncio.TimeoutError as e: writer({"generating_questions": "Timeout error from reasoning LLM, please try again"})return {"queries": []}

    Step 3: How to monitor, trace, and optimize your research agent’s performance as you scale up to production 

    With all this information in hand, we were able to deploy the AI-Q research agent with the appropriate number of replicas across various system components. As a final step, we scaled out using a phased approach—starting with small teams and gradually adding additional users. During the rollout, it was critical to observe application performance. We used the NeMo Agent Toolkit OpenTelemetry (OTEL) collector along with Datadog to capture logs, performance data, and LLM trace information.

    general:  telemetry:    tracing:   otelcollector:     _type: otelcollector       # Your otel collector endpoint       endpoint: http://0.0.0.0:4318/v1/traces       project: your_project_name

    The OTEL collector integration allows us to view specific traces for individual user sessions, helping us understand both application performance and LLM behavior.

    Figure 7. Datadog flame graph showing timing for a real user session

    We are also able to aggregate performance data across traces to understand how the application was performing. The following chart shows average latency and user sessions with outlying performance. 

    Figure 8. Datadog latency analysis showing p95 times and outliers for individually tracked functions.

    Conclusion

    By using the NeMo Agent Toolkit in conjunction with a variety of AI factory reference partners, we were able to deploy an internal version of the AI-Q NVIDIA Blueprint and build a research agent with confidence. 

    Learn more about building with NeMo Agent Toolkit or try out the AI-Q NVIDIA research agent blueprint for yourself.

    Fish AI Reader

    Fish AI Reader

    AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

    FishAI

    FishAI

    鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

    联系邮箱 441953276@qq.com

    相关标签

    NVIDIA NeMo Agent Toolkit AI Agent Scale-up Production Deployment Load Testing Performance Optimization LangGraph NIM OTEL
    相关文章