NVIDIA Blog 10月10日 00:37
Azure推出首个NVIDIA GB300 NVL72超级计算集群,赋能前沿AI
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

微软Azure今日宣布推出全新的NDv6 GB300 VM系列,构建了行业首个专为OpenAI最严苛AI推理工作负载设计的、超级计算规模的生产集群。该集群集成了超过4600颗NVIDIA Blackwell Ultra GPU,并通过NVIDIA Quantum-X800 InfiniBand网络平台连接。微软通过创新的系统设计,在内存和网络方面进行了大胆的工程优化,实现了大规模计算能力,以支持推理模型和代理AI系统的高效推理和训练吞吐量。此次发布是NVIDIA与微软长期深度合作的成果,旨在为最前沿的AI应用提供基础设施支持,推动AI创新。

🚀 **行业首创超级计算集群:** 微软Azure推出了全新的NDv6 GB300 VM系列,构建了业界首个专为OpenAI最严苛AI推理工作负载设计的、达到超级计算规模的生产级集群。该集群的核心是超过4600颗NVIDIA Blackwell Ultra GPU,这些GPU通过NVIDIA Quantum-X800 InfiniBand网络平台互联,旨在为大规模AI模型提供前所未有的计算能力。

💡 **NVIDIA GB300 NVL72系统详解:** 集群的核心是液冷式、机架级NVIDIA GB300 NVL72系统。每个机架集成了72颗NVIDIA Blackwell Ultra GPU和36颗NVIDIA Grace CPU,形成一个统一的单元,极大地加速了海量AI模型的训练和推理。该系统提供高达37TB的快速内存和每VM 1.44 exaflops的FP4 Tensor Core性能,为推理模型、代理AI系统和复杂的多模态生成式AI创建了一个庞大的统一内存空间。

🌐 **先进的网络互联架构:** 为了将超过4600颗Blackwell Ultra GPU整合为一个统一的超级计算系统,Azure集群采用了双层NVIDIA网络架构。机架内部,第五代NVIDIA NVLink Switch fabric提供了130 TB/s的直接全连接带宽;跨机架扩展,NVIDIA Quantum-X800 InfiniBand平台通过NVIDIA ConnectX-8 SuperNICs和Quantum-X800交换机,为每颗GPU提供800 Gb/s的带宽,确保了GPU之间无缝通信。此外,还利用了高级自适应路由、拥塞控制和NVIDIA SHARP v4等技术,显著提升了大规模训练和推理的效率。

📈 **卓越的性能表现:** NVIDIA Blackwell Ultra平台在训练和推理方面均表现出色。在最新的MLPerf Inference v5.1基准测试中,NVIDIA GB300 NVL72系统使用NVFP4取得了创纪录的性能。例如,在处理6710亿参数的DeepSeek-R1推理模型时,相比NVIDIA Hopper架构,每GPU吞吐量提升高达5倍,并在Llama 3.1 405B模型等新基准测试中也展现了领先性能。

Microsoft Azure today announced the new NDv6 GB300 VM series, delivering the industry’s first supercomputing-scale production cluster of NVIDIA GB300 NVL72 systems, purpose-built for OpenAI’s most demanding AI inference workloads.

This supercomputer-scale cluster features over 4,600 NVIDIA Blackwell Ultra GPUs connected via the NVIDIA Quantum-X800 InfiniBand networking platform. Microsoft’s unique systems approach applied radical engineering to memory and networking to provide the massive scale of compute required to achieve high inference and training throughput for reasoning models and agentic AI systems.

Today’s achievement is the result of years of a deep partnership between NVIDIA and Microsoft purpose-building AI infrastructure for the world’s most demanding AI workloads and to deliver infrastructure for the next frontier of AI. It marks another leadership moment, ensuring that leading-edge AI drives innovation in the United States.

“Delivering the industry’s first at-scale NVIDIA GB300 NVL72 production cluster for frontier AI is an achievement that goes beyond powerful silicon — it reflects Microsoft Azure and NVIDIA’s shared commitment to optimize all parts of the modern AI data center,” said Nidhi Chappell, corporate vice president of Microsoft Azure AI Infrastructure.

“Our collaboration helps ensure customers like OpenAI can deploy next-generation infrastructure at unprecedented scale and speed.”

Inside the Engine: The NVIDIA GB300 NVL72

At the heart of Azure’s new NDv6 GB300 VM series is the liquid-cooled, rack-scale NVIDIA GB300 NVL72 system. Each rack is a powerhouse, integrating 72 NVIDIA Blackwell Ultra GPUs and 36 NVIDIA Grace CPUs into a single, cohesive unit to accelerate training and inference for massive AI models.

The system provides a staggering 37 terabytes of fast memory and 1.44 exaflops of FP4 Tensor Core performance per VM, creating a massive, unified memory space essential for reasoning models, agentic AI systems and complex multimodal generative AI.

NVIDIA Blackwell Ultra is supported by the full-stack NVIDIA AI platform, including collective communication libraries that tap into new formats like NVFP4 for breakthrough training performance, as well as compiler technologies like NVIDIA Dynamo for the highest inference performance in reasoning AI.

The NVIDIA Blackwell Ultra platform excels at both training and inference. In the recent MLPerf Inference v5.1 benchmarks, NVIDIA GB300 NVL72 systems delivered record-setting performance using NVFP4. Results included up to 5x higher throughput per GPU on the 671-billion-parameter DeepSeek-R1 reasoning model compared with the NVIDIA Hopper architecture, along with leadership performance on all newly introduced benchmarks like the Llama 3.1 405B model.

The Fabric of a Supercomputer: NVLink Switch and NVIDIA Quantum-X800 InfiniBand

To connect over 4,600 Blackwell Ultra GPUs into a single, cohesive supercomputer, Microsoft Azure’s cluster relies on a two-tiered NVIDIA networking architecture designed for both scale-up performance within the rack and scale-out performance across the entire cluster.

Within each GB300 NVL72 rack, the fifth-generation NVIDIA NVLink Switch fabric provides 130 TB/s of direct, all-to-all bandwidth between the 72 Blackwell Ultra GPUs. This transforms the entire rack into a single, unified accelerator with a shared memory pool — a critical design for massive, memory-intensive models.

To scale beyond the rack, the cluster uses the NVIDIA Quantum-X800 InfiniBand platform, purpose-built for trillion-parameter-scale AI. Featuring NVIDIA ConnectX-8 SuperNICs and Quantum-X800 switches, NVIDIA Quantum-X800 provides 800 Gb/s of bandwidth per GPU, ensuring seamless communication across all 4,608 GPUs.

Microsoft Azure’s cluster also uses NVIDIA Quantum-X800’s advanced adaptive routing, telemetry-based congestion control and performance isolation capabilities, as well as NVIDIA Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) v4, which accelerates operations to significantly boost the efficiency of large-scale training and inference.

Driving the Future of AI

Delivering the world’s first production NVIDIA GB300 NVL72 cluster at this scale required a reimagination of every layer of Microsoft’s data center — from custom liquid cooling and power distribution to a reengineered software stack for orchestration and storage.

This latest milestone marks a big step forward in building the infrastructure that will unlock the future of AI. As Azure scales to its goal of deploying hundreds of thousands of NVIDIA Blackwell Ultra GPUs, even more innovations are poised to emerge from customers like OpenAI.

Learn more about this announcement on the Microsoft Azure blog

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Microsoft Azure NVIDIA GB300 NVL72 AI Infrastructure Supercomputing OpenAI Blackwell Ultra GPU InfiniBand AI Inference AI Training
相关文章