Nvidia Developer 10月10日 07:43
NVIDIA Blackwell GPU 在 InferenceMAX v1 评测中展现领先性能
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

SemiAnalysis发布的InferenceMAX™ v1评测显示,NVIDIA Blackwell GPU在推理性能上显著优于上一代Hopper。Blackwell通过硬件-软件协同设计,包括对NVFP4低精度格式、第五代NVLink和NVLink Switch以及TensorRT-LLM和Dynamo推理框架的原生支持,实现了高达15倍的性能提升。这一开放的评测方法论允许社区复现NVIDIA的领先性能,并验证Blackwell在各种AI推理场景下的通用性和性能优势。评测数据覆盖了从延迟、吞吐量到批处理大小等多个维度,并采用了多样的模型、精度和序列长度,为AI推理的规模化部署提供了有力依据。

🚀 **NVIDIA Blackwell 性能飞跃:** InferenceMAX v1 评测结果显示,NVIDIA Blackwell GPU(包括 DGX B200 和 GB200 NVL72 平台)相比上一代的 Hopper HGX H200,在推理效率和成本效益方面实现了代际性的巨大提升。Blackwell 架构凭借其第五代 Tensor Cores、原生 FP4 加速、1,800 GB/s 的 NVLink 带宽以及最新的 HBM3e 内存,实现了计算每瓦特和内存带宽的量级增长,从而显著提高了能效并大幅降低了每百万 token 的成本。

💡 **持续软件优化驱动性能提升:** NVIDIA 不仅在硬件上持续创新,还在软件层面不断优化,推动性能稳步提升。例如,在 gpt-oss-120b 模型上,通过 TensorRT-LLM 栈的持续优化,性能在短短两个月内实现了高达 5 倍的提升,最大每 GPU 吞吐量从发布初的约 6K 提升至 30K。此外,引入的 DEP(数据与专家并行)等并行配置,以及 NVLink 的高速连接,使得系统能够以更低的成本、更快的速度服务大规模推理请求。

📊 **InferenceMAX v1 的全面性与开放性:** InferenceMAX v1 采用持续、自动化的测试方法,每日发布基准测试结果,覆盖多种推理框架(SGLang、TensorRT-LLM、vLLM)、单节点和多节点配置。该评测涵盖了推理延迟、吞吐量、批处理大小、以及多种输入/输出比例,并针对推理、文档处理、摘要和聊天等多种场景进行了测试,确保结果能够反映真实世界的部署情况。其开放性鼓励了整个 AI 社区参与验证和推动性能。

💰 **显著的成本效益与营收机会:** 评测数据表明,NVIDIA Blackwell GPU 平台不仅带来了显著的性能提升,还解锁了巨大的营收机会。例如,在 DeepSeek-R1 测试中,Blackwell GB200 NVL72 相较于 Hopper H200,展现了 15 倍的性能优势,这意味着在相同的投资下,能够产生更高的投资回报率和更可观的代币收入。软件优化也进一步降低了每百万 token 的成本,使大规模 AI 推理更具经济可行性。

SemiAnalysis recently launched InferenceMAX™ v1, a new open source initiative that provides a comprehensive methodology to evaluate inference hardware performance. Published results demonstrate that NVIDIA GPUs deliver the highest inference performance across all workloads.  

What does the data tell us? NVIDIA Blackwell demonstrated a 15x performance gain over the Hopper generation, unlocking a 15x revenue opportunity (Figure 1). This industry-leading performance and profitability are driven by extreme hardware-software co-design, including native support for NVFP4 low precision format, fifth-generation NVIDIA NVLink and NVLink Switch, and NVIDIA TensorRT-LLM and NVIDIA Dynamo inference frameworks.

With InferenceMAX v1 now open source, the AI community can reproduce NVIDIA’s industry-leading performance. We invite our customers, partners, and the wider ecosystem to use these recipes to validate the versatility and performance leadership of NVIDIA Blackwell across many AI inference scenarios.  

This independent third-party evaluation from SemiAnalysis provides yet another example of the world-class performance that the NVIDIA inference platform delivers for deploying AI at scale.

Figure 1. DeepSeek-R1 8K/1K results show a 15x performance benefit and revenue opportunity for NVIDIA Blackwell GB200 NVL72 over Hopper H200

Inside InferenceMAX v1

A key differentiator of InferenceMAX v1 is its continuous, automated testing. Continuous Integration (CI) results from benchmark sweeps are published each day, with tests run across multiple inference frameworks, SGLang, TensorRT-LLM, and vLLM, to capture performance improvements from the latest software releases.

The benchmarks cover both single-node and multi-node wide Expert Parallelism (EP) configurations, ensuring results reflect the diverse deployment scenarios used in production environments. Table 1 provides additional details on the models, precisions, input sequence lengths (ISL) and output sequence lengths (OSL) tested. Variable sequence lengths are used (80-100% of ISL/OSL combinations) to reflect the dynamic nature of real-world deployments.

ModelTypeParametersPrecisionsChat (ISL/OSL)Summarization (ISL/OSL)Deep Reasoning (ISL/OSL)
DeepSeek-R1MoE671B
(37B active)
FP8, NVFP41K/1K8K/1K1K/8K
gpt-oss-120bMoE117B (5.1B active)FP8, MXFP41K/1K8K/1K1K/8K
Llama 3.3 70BDense70BFP8, NVFP41K/1K8K/1K1K/8K
Table 1. Types of models, precisions, and input and output sequence lengths covered in the InferenceMAX v1 benchmarks

InferenceMAX v1 provides data across multiple dimensions including latency, throughput, batch sizes, and various input/output ratios covering reasoning tasks, document processing and summarization, and chat scenarios.

How did NVIDIA Blackwell perform in InferenceMAX v1?

The InferenceMAX v1 benchmark data clearly show that the generational leap from NVIDIA Hopper HGX H200 to NVIDIA Blackwell DGX B200 and NVIDIA GB200 NVL72 platforms bring dramatic gains in efficiency and cost-effectiveness. Blackwell features fifth-generation Tensor Cores with native FP4 acceleration and 1,800 GB/s of NVLink bandwidth, and uses the latest HBM3e memory. 

This leads to an order-of-magnitude increase in compute-per-watt and memory bandwidth, delivering both significantly better energy efficiency and dramatically lower cost per million tokens compared to Hopper. 

This post dives into the standout innovations behind these results and breaks down how the Blackwell architecture delivers such remarkable performance.

Continuous software optimizations deliver boost in performance over time

Alongside the steady cadence of NVIDIA hardware innovation, NVIDIA also drives continuous performance gains through ongoing software optimizations. At the initial model launch of gpt-oss-120b, Blackwell B200 performance with TensorRT-LLM was solid, but left room for improvement as early per-GPU throughput numbers were substantially lower than today’s best. In a short time, NVIDIA engineering teams and the wider community have worked extensively to optimize the TensorRT-LLM stack for open source LLMs, unlocking even better performance (Figure 2). 

Figure 2. NVIDIA TensorRT-LLM sees 60,000 TPS/GPU max throughput, 1,000 TPS/user max interactivity, and 5x performance improvement in two months on gpt-oss-120b

The B200 InferenceMAX v1 configuration in Figure 2 shows the progress achieved since the launch of gpt-oss on August 5, leading to boosted throughput at all points of the Pareto frontier. At roughly 100 TPS/user, B200 achieves almost 2x better throughput on InferenceMax v1 than at the model launch.

Looking at October 9, the latest version of TensorRT-LLM introduces powerful new features such as EP and DEP (Data and Expert Parallelism) mappings, further increasing max throughput at 100 TPS/user by up to 5x compared to launch day, rising from roughly 6K to 30K max per-GPU throughput. One way this is achieved is by leveraging higher concurrencies compared to ones seen in the InferenceMAX v1 benchmark, as InferenceMAX currently only tests concurrencies 4-64. 

In addition, parallelism configurations like DEP achieve high throughput by distributing gpt-oss-120b Attention and MoE layers across multiple GPUs. This rapid, all-to-all communication is made possible by the 1,800 GB/s bidirectional bandwidth of NVLink and the NVLink Switch, which avoids traditional PCIe bottlenecks. The resulting high concurrency enables the system to serve many simultaneous inference requests at full speed, keeping the hardware fully utilized for all users (Figure 3).

Figure 3. The gpt-oss-120b Pareto frontier favors multi-GPU Tensor and Expert Parallelism greater than TP1, which rely on high-speed GPU-to-GPU NVLink

For instance, in the full DEP2 scheme, attention for each request is handled on one GPU (with its KV cache localized), while expert tokens for the MoE layers are dynamically routed and processed across two GPUs (64 experts per GPU). The NVLink Switch fabric ensures these expert tokens are distributed and aggregated with minimal delay, supporting immediate, direct exchanges between GPUs. 

Another significant milestone is the enablement of speculative decoding for gpt-oss-120b using the newly released gpt-oss-120b-Eagle3-v2 model. With EAGLE-enabled speculation, per-GPU throughput throughput at 100 TPS/user triples compared to published InferenceMAX v1 results, going from 10K to 30K tokens/second, making large-scale inference significantly more cost-efficient and responsive.

In fact, accounting for these software improvements, in the two months since the model was released, the cost per million tokens at 100 TPS/user has reduced 5x, from $0.11 at launch to $0.02 today (Figure 4). For API service providers, this translates to greater revenue as model inference becomes both faster and less expensive to deliver at scale. Even at an ultra-high interactivity of 400 TPS/user, the cost per million tokens stays relatively low at $0.12, enabling the feasibility of more complex multi-agent use cases.

These layered software enhancements, combined with open innovation, underscore NVIDIA’s commitment to pushing both hardware and software boundaries for generative AI at scale.

NVIDIA Blackwell powers high-efficiency Llama 3.3 70B inference with NVFP4

Blackwell B200 sets a new performance standard in InferenceMAX v1 benchmarks for dense AI models, such as Llama 3.3 70B, that demand significant computational resources due to their large parameter count and the fact that all parameters are utilized simultaneously during inference. Blackwell delivers 10,000 tokens per second at 50 TPS/user in the Llama 3.3 70B 1K/1K benchmark, more than 4x higher per-GPU throughput compared to Hopper H200 (Figure 5).

This demonstrates that Blackwell architectural innovations such as NVFP4 support leadership in both dense and sparse workloads, enabling faster inference and more responsive experiences for users regardless of model complexity. 

By mapping performance and TCO across this frontier, InferenceMAX v1 shows that the NVIDIA Blackwell platform leads not just at one optimal point, but across the entire range of operational demands.

Figure 5. Blackwell B200 achieves up to 4x more throughput versus Hopper H200 on the Llama 3.3 70B 1K/1K benchmark

Blackwell GB200 NVL72 is the new standard in AI cost efficiency

The data from InferenceMAX v1 shows that GB200 NVL72 delivers significantly better total cost of ownership (TCO) compared to the prior generation H200 on the DeepSeek-R1 reasoning model (Figure 6).

Figure 6. Blackwell GB200 NVL72 demonstrates a clear TCO advantage compared to the previous Hopper generation

Across all measured interactivity levels, indicated as tokens per second per user, GB200 NVL72 consistently delivers a significantly lower cost per million tokens compared to H200. For example, at an interactivity of roughly 75 tokens per second the H200 cost is $1.56 per million tokens, while GB200 NVL72 brings this down to just over $0.10 per million tokens, a striking 15x reduction in cost. The GB200 cost curve remains substantially flatter for longer, which allows for serving past 100 TPS/user before costs noticeably increase.

For large-scale AI deployments, the implications of this performance are profound: AI factories leveraging GB200 NVL72 can serve far more users at better interactivity targets without incurring higher operational expenses or sacrificing throughput. 

Overall, as interactivity demands and the number of concurrent users grow, GB200 NVL72 maintains the lowest cost per million tokens among all compared architectures, making it the ideal solution for maximizing both user base and revenue at massive scale. 

Disaggregated serving and how GB200 NVL72, Dynamo, and TensorRT-LLM unlock the full performance of MoE models

Verified benchmarks from SemiAnalysis (Figures 1 and 6) show that the combination of GB200 NVL72, Dynamo, and TensorRT-LLM dramatically increases throughput of MoE models like DeepSeek-R1 under a wide range of SLA constraints, outperforming previous-generation Hopper-based systems.

The GB200 NVL72 scale-up design connects 72 GPUs through high-speed NVLink, forming a single, tightly integrated domain with up to 130 TB/s of bandwidth for GPU-to-GPU communication. This high-bandwidth, low-latency interconnect is critical for MoE models, enabling seamless communication between experts without the bottlenecks introduced by traditional internode links like InfiniBand.

In parallel, disaggregated inference in Dynamo introduces another layer of efficiency by separating the prefill and decode phases across different GB200 NVL72 nodes. This separation is critical as it enables each phase to be independently optimized with different GPU counts and configurations. The memory-bound decode phase can now leverage wide EP for expert execution without holding back the compute-heavy prefill phase.

Finally, TensorRT-LLM mitigates the risk of GPU underutilization in EP. In large-scale wide EP deployments, it’s common for some GPUs to remain idle if they host experts that are rarely activated. This leads to inefficient use of compute resources. To address this, the wide EP implementation of TensorRT-LLM intelligently monitors expert load and distributes frequently used experts across different GPUs. It can also replicate popular experts to better balance workloads. This ensures efficient GPU usage and performance. 

Together, GB200 NVL72, Dynamo, and TensorRT-LLM create an inference-optimized stack that unlocks the full potential of MoE models. 

NVIDIA partners with SGLang and vLLM to co-develop kernels and optimizations

Beyond advancements in the open source Dynamo and TensorRT-LLM frameworks, NVIDIA has partnered with the SGLang and vLLM open source projects to co-develop new Blackwell kernels and optimizations. These contributions, delivered through FlashInfer, include enhanced or newly introduced kernels for Attention Prefill & Decode, Communication, GEMM, MNNVL, MLA, and MoE. 

At the runtime level, further optimizations have been contributed to these LLM frameworks over the last few months. For SGLang, support for MTP (multi-token prediction) and disaggregation for the DeepSeek-R1 model were added. For vLLM, overlap async scheduling capabilities to reduce host overhead and improve throughput and automatic graph fusions were implemented. Additionally, performance and functionality improvements for gpt-oss, Llama 3.3, and general architectures have also been integrated into vLLM. 

Through advanced hardware, software optimizations, and open source collaboration, NVIDIA enables full performance and efficiency of Blackwell across popular open source inference frameworks.

Get started with NVIDIA Blackwell

The launch of SemiAnalysis InferenceMAX v1 benchmarking suite introduces an open source and continuously updated framework for measuring inference performance. Through InferenceMAX v1, the NVIDIA Blackwell family has emerged as a clear leader, with B200 and GB200 NVL72 demonstrating up to 15x improvements in performance over the previous Hopper generation and driving up to a 15x revenue opportunity for AI factories. 

These results validate NVIDIA Blackwell architectural innovations, including NVFP4 precision, NVLink 5 interconnects, TensorRT-LLM, and Dynamo across a wide set of workloads and open source inference frameworks. As the NVIDIA platform continues to advance, ongoing software improvements drive even greater value.

Learn more and check out our latest NVIDIA performance data

To explore or reproduce the benchmarks, visit the SemiAnalysis InferenceMAX GitHub repo where the full set of containers and configurations are available.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

NVIDIA Blackwell InferenceMAX v1 AI推理 GPU性能 SemiAnalysis TensorRT-LLM NVLink
相关文章