NVIDIA Blog 10月10日 08:13
NVIDIA Blackwell 在 InferenceMAX 基准测试中表现卓越
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

最新发布的 SemiAnalysis InferenceMAX v1 基准测试显示,NVIDIA Blackwell 平台在 AI 推理性能和能效方面取得了领先地位。该测试首次衡量了跨多种模型和真实场景的总计算成本。NVIDIA GB200 NVL72 系统展现了卓越的投资回报率,一个 500 万美元的投资预计可产生 7500 万美元的代币收入,投资回报率高达 15 倍。此外,通过软件优化,NVIDIA B200 在 gpt-oss 模型上的每百万代币成本降至两美分,并在短短两个月内实现了 5 倍的成本降低。在吞吐量和交互性方面,B200 凭借 NVIDIA TensorRT-LLM 栈,在 gpt-oss 模型上实现了每 GPU 60,000 令牌/秒和每用户 1,000 令牌/秒的性能。

🚀 **卓越的性能与能效表现**:NVIDIA Blackwell 平台在 SemiAnalysis InferenceMAX v1 基准测试中脱颖而出,提供了行业领先的 AI 推理性能和整体能效。该测试首次独立评估了在各种模型和实际应用场景下的总计算成本,突显了 Blackwell 在经济效益和性能上的优势。

💰 **显著的投资回报与成本效益**:NVIDIA GB200 NVL72 系统展示了非凡的 AI 工厂经济效益,一个 500 万美元的投资有望带来 7500 万美元的代币收入,实现 15 倍的投资回报率。同时,通过持续的软件优化,NVIDIA B200 在 gpt-oss 模型上实现了每百万代币仅需两美分的成本,并在两个月内将成本降低了 5 倍,显著降低了总拥有成本。

⚡ **领先的吞吐量与交互性**:NVIDIA B200 凭借最新的 NVIDIA TensorRT-LLM 栈,在 gpt-oss 模型上实现了每 GPU 60,000 令牌/秒的吞吐量和每用户 1,000 令牌/秒的交互速度,为大规模 AI 应用提供了强大的支持。

💡 **全栈式优化驱动价值**:NVIDIA 的全栈方法,结合硬件和软件的协同设计,优化了从模型到基础设施的每一个环节。通过与 OpenAI、Meta 等领先的开源社区合作,NVIDIA 确保了最新模型的性能最大化,加速了 AI 的普及和创新。

📈 **AI 推理经济学的变革**:随着 AI 从简单的问答转向复杂的推理,推理的需求和经济学正在爆炸式增长。Blackwell 平台通过提供卓越的性能、能效和成本效益,正在重塑 AI 推理的经济模型,使企业能够更高效地将 AI 的价值转化为实际收益。

As AI shifts from one-shot answers to complex reasoning, the demand for inference — and the economics behind it — is exploding.

The new independent InferenceMAX v1 benchmarks are the first to measure total cost of compute across real-world scenarios. The results? The NVIDIA Blackwell platform swept the field — delivering unmatched performance and best overall efficiency for AI factories.

A $5 million investment in an NVIDIA GB200 NVL72 system can generate $75 million in token revenue. That’s a 15x return on investment (ROI) — the new economics of inference.

“Inference is where AI delivers value every day,” said Ian Buck, vice president of hyperscale and high-performance computing at NVIDIA. “These results show that NVIDIA’s full-stack approach gives customers the performance and efficiency they need to deploy AI at scale.”

Enter InferenceMAX v1

InferenceMAX v1, a new benchmark from SemiAnalysis released Monday, is the latest to highlight Blackwell’s inference leadership. It runs popular models across leading platforms, measures performance for a wide range of use cases and publishes results anyone can verify.

Why do benchmarks like this matter?

Because modern AI isn’t just about raw speed — it’s about efficiency and economics at scale. As models shift from one-shot replies to multistep reasoning and tool use, they generate far more tokens per query, dramatically increasing compute demands.

NVIDIA’s open-source collaborations with OpenAI (gpt-oss 120B), Meta (Llama 3 70B), and DeepSeek AI (DeepSeek R1) highlight how community-driven models are advancing state-of-the-art reasoning and efficiency.

Partnering with these leading model builders and the open-source community, NVIDIA ensures the latest models are optimized for the world’s largest AI inference infrastructure. These efforts reflect a broader commitment to open ecosystems — where shared innovation accelerates progress for everyone.

Deep collaborations with the FlashInfer, SGLang and vLLM communities enable codeveloped kernel and runtime enhancements that power these models at scale.

Software Optimizations Deliver Continued Performance Gains

NVIDIA continuously improves performance through hardware and software codesign optimizations. Initial gpt-oss-120b performance on an NVIDIA DGX Blackwell B200 system with the NVIDIA TensorRT LLM library was market-leading, but NVIDIA’s teams and the community have significantly optimized TensorRT LLM for open-source large language models.

The TensorRT LLM v1.0 release is a major breakthrough in making large AI models faster and more responsive for everyone.

Through advanced parallelization techniques, it uses the B200 system and NVIDIA NVLink Switch’s 1,800 GB/s bidirectional bandwidth to dramatically improve the performance of the gpt-oss-120b model.

The innovation doesn’t stop there. The newly released gpt-oss-120b-Eagle3-v2 model introduces speculative decoding, a clever method that predicts multiple tokens at a time.

This reduces lag and delivers even quicker results, tripling throughput at 100 tokens per second per user (TPS/user) — boosting per-GPU speeds from 6,000 to 30,000 tokens.

For dense AI models like Llama 3.3 70B, which demand significant computational resources due to their large parameter count and the fact that all parameters are utilized simultaneously during inference, NVIDIA Blackwell B200 sets a new performance standard in InferenceMAX v1 benchmarks.

Blackwell delivers over 10,000 TPS per GPU at 50 TPS per user interactivity — 4x higher per-GPU throughput compared with the NVIDIA H200 GPU.

Performance Efficiency Drives Value

Metrics like tokens per watt, cost per million tokens and TPS/user matter as much as throughput. In fact, for power-limited AI factories, Blackwell delivers 10x throughput per megawatt compared with the previous generation, which translates into higher token revenue.

The cost per token is crucial for evaluating AI model efficiency, directly impacting operational expenses. The NVIDIA Blackwell architecture lowered cost per million tokens by 15x versus the previous generation, leading to substantial savings and fostering wider AI deployment and innovation.

Multidimensional Performance

InferenceMAX uses the Pareto frontier — a curve that shows the best trade-offs between different factors, such as data center throughput and responsiveness — to map performance.

But it’s more than a chart. It reflects how NVIDIA Blackwell balances the full spectrum of production priorities: cost, energy efficiency, throughput and responsiveness. That balance enables the highest ROI across real-world workloads.

Systems that optimize for just one mode or scenario may show peak performance in isolation, but the economics of that doesn’t scale. Blackwell’s full-stack design delivers efficiency and value where it matters most: in production.

For a deeper look at how these curves are built — and why they matter for total cost of ownership and service-level agreement planning — check out this technical deep dive for full charts and methodology.

What Makes It Possible?

Blackwell’s leadership comes from extreme hardware-software codesign. It’s a full-stack architecture built for speed, efficiency and scale:

The Bigger Picture

AI is moving from pilots to AI factories — infrastructure that manufactures intelligence by turning data into tokens and decisions in real time.

Open, frequently updated benchmarks help teams make informed platform choices, tune for cost per token, latency service-level agreements and utilization across changing workloads.

NVIDIA’s Think SMART framework helps enterprises navigate this shift, spotlighting how NVIDIA’s full-stack inference platform delivers real-world ROI — turning performance into profits.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

NVIDIA Blackwell AI 推理 InferenceMAX 基准测试 性能 能效 成本效益 GB200 NVL72 B200 TensorRT-LLM NVIDIA AI Factories ROI Total Cost of Ownership NVIDIA Blackwell AI Inference InferenceMAX Benchmarks Performance Efficiency Cost-Effectiveness GB200 NVL72 B200 TensorRT-LLM NVIDIA AI Factories ROI Total Cost of Ownership
相关文章