Nvidia Developer 09月03日
AI模型复杂度提升与NVLink技术发展
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

随着AI模型复杂度的指数级增长,参数数量从百万跃升至万亿,对计算资源提出了前所未有的要求。混合专家(MoE)架构和推理时扩展进一步增加了计算需求。为高效部署推理,AI系统正转向大规模并行化策略,包括张量、流水线和专家并行。这推动了更大规模GPU集群的需求,这些集群通过内存语义扩展计算 fabric 作为统一的计算和内存池。本文详细介绍了NVIDIA NVLink如何通过NVLink Fusion技术满足复杂AI模型的性能和规模需求,支持多达72个GPU的全互连通信,提供高达1,800 GB/s的带宽。NVLink Switch技术的演进和SHARP协议优化了带宽和延迟。NCCL库加速了GPU间通信,支持扩展架构。NVLink Fusion为超大规模企业提供了定制硅(CPU和XPU)与NVLink扩展 fabric的集成,支持模块化OCP MGX机架解决方案,提供高性能和灵活性。

🔹 AI模型复杂度提升导致参数数量从百万级跃升至万亿级,对计算资源提出更高要求。混合专家(MoE)架构和推理时扩展进一步增加计算需求,推动大规模并行化策略的应用。

🔸 NVIDIA NVLink技术通过NVLink Switch和SHARP协议优化带宽和延迟,支持多达72个GPU的全互连通信,提供高达1,800 GB/s的带宽,满足复杂AI模型的性能需求。

📊 NCCL库作为开源通信库,加速GPU间通信,支持单节点和多节点拓扑,实现近理论带宽,并集成到所有主要深度学习框架中,具备十年开发经验和生产部署。

🔗 NVLink Fusion技术为超大规模企业提供了定制硅(CPU和XPU)与NVLink扩展 fabric的集成,支持模块化OCP MGX机架解决方案,提供高性能和灵活性,支持Universal Chiplet Interconnect Express (UCIe)和NVLink-C2C接口。

🌐 NVLink Fusion拥有强大的硅生态系统,包括定制硅、CPU和IP技术合作伙伴,以及系统合作伙伴网络和数据中心基础设施组件供应商,确保快速设计能力和持续技术进步。

The exponential growth in AI model complexity has driven parameter counts from millions to trillions, requiring unprecedented computational resources that require clusters of GPUs to accommodate. The adoption of mixture-of-experts (MoE) architectures and AI reasoning with test-time scaling increases compute demands even more. To efficiently deploy inference, AI systems have evolved toward large-scale parallelization strategies, including tensor, pipeline, and expert parallelism. This is driving the need for larger domains of GPUs connected by a memory-semantic scale-up compute fabric to operate as a unified pool of compute and memory. 

This blog post details how the performance and breadth of NVIDIA NVLink scale-up fabric technologies are made available through NVIDIA NVLink Fusion to address the growing demands of complex AI models.

Figure 1. Rising model size and complexity drive scale-up domain size 

NVIDIA first introduced NVLink in 2016 to overcome the limitations of PCIe in high-performance computing and AI workloads. It enabled faster GPU-to-GPU communication and created a unified memory space.

In 2018, the introduction of NVIDIA NVLink Switch technology achieved 300 GB/s all-to-all bandwidth between every GPU in an 8-GPU topology, paving the way for scale-up compute fabrics in the multi-GPU compute era. NVIDIA Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) technology was introduced with the third-generation NVLink Switch for performance benefits such as optimized bandwidth reductions and collective operation latency reduction.  

With the fifth-generation NVLink released in 2024, NVLink Switch enhancements support 72 GPUs all-to-all communication at 1,800 GB/s, giving 130 TB/s of aggregate bandwidth—800x more than the first generation. 

Despite being production-deployed at scale for nearly a decade, NVIDIA continues to push the limits, delivering the next three NVLink generations at an annual pace. This approach delivers continuous technological advancement that matches the exponential growth in AI model complexity and computational requirements.  

NVLink performance relies on hardware and communication libraries—notably the NVIDIA Collective Communication Library (NCCL).

NCCL was developed as an open-source library to accelerate communication between GPUs in single-node and multi-node topologies, achieving near-theoretical bandwidth for GPU-to-GPU communication. It seamlessly supports scale-up and scale-out and includes automatic topology awareness and optimizations. NCCL is integrated into every major deep learning framework, benefiting from 10 years of development and 10 years of production deployment.

Figure 3. NCCL across scale-up and scale-out, supported in all major frameworks

Maximizing AI factory revenue

NVIDIA hardware and library experience with NVLink, along with a large domain size, meet today’s AI reasoning compute needs. The 72-GPU rack architecture plays a crucial role in this alignment by enabling optimal inference performance across use cases. When evaluating LLM inference performance, the frontier Pareto curves show the balance between throughput per watt and latency.

The goal for AI factory productivity and revenue is to maximize the area under the curve.  Many variables affect the curve dynamics, including raw compute, memory capacity, and throughput, along with scale-up technology that enables optimizations across tensor, pipeline, expert parallel, etc., with high-speed communication.

When examining performance across various scale-up configurations, we see notable differences. These changes occur even when NVLink speed remains constant.

    For NVLink in a 4-GPU mesh (with no switch), the curve suffers from splitting bandwidth to each GPU.An 8-GPU topology with NVLink Switch significantly boosts performance as it achieves full bandwidth for every GPU-to-GPU connection.Increasing to a 72-GPU domain with NVLink Switch maximizes revenue and performance.
Figure 4. NVLink scale-up fabric drives AI factory revenue

NVIDIA introduced NVLink Fusion to give hyperscalers access to all of the NVLink production-proven scale-up technologies. It enables custom silicon (CPUs and XPUs) to integrate with NVIDIA NVLink scale-up fabric technology and rack-scale architecture for semi-custom AI infrastructure deployment. 

The NVLink scale-up fabric technology access includes the NVLink SERDES, NVLink chiplets, NVLink Switches, and all aspects of the rack-scale architecture. The high-density rack-scale architecture includes the NLVink spine, copper cable system, mechanical innovations, advanced power and liquid cooling technology, and an ecosystem with supply chain readiness.

NVLink Fusion offers versatile solutions for custom CPU, custom XPU, or combined custom CPU and custom XPU configurations. Being available as a modular Open Compute Project (OCP) MGX rack solution enables NVLink Fusion integration with any NIC, DPU, or scale-out switch, giving customers the flexibility to build what they need. 

Figure 5. NVLink Fusion flexible infrastructure options for adopting NVLink scale-up fabric

For custom XPU configurations, the interface to NVLink utilizes integration of Universal Chiplet Interconnect Express (UCIe) IP and interface. NVIDIA provides the bridge chiplet for UCIe to NVLink for the highest performance and ease of integration while still giving adopters with the same level of access to NVLink capabilities as NVIDIA. UCIe is an open standard, and by using this interface for NVLink integration, it gives customers the flexibility to choose other options for their XPU integration needs across their current or future platforms. 

Figure 6. NVLink Fusion with XPU access to NVLink through the NVLink chiplet

For custom CPU configurations, integration of NVIDIA NVLink-C2C IP for connectivity to NVIDIA GPUs is recommended for optimal performance. Systems with custom CPUs and NVIDIA GPUs gain access to hundreds of NVIDIA CUDA-X libraries as part of the CUDA platform, for advanced performance in accelerated computing. 

Figure 7. NVLink Fusion with custom CPU access to NVLink through NVLink-C2C

Supported by an extensive, production-ready partner ecosystem

NVLink Fusion includes a robust silicon ecosystem, including partners for custom silicon, CPUs, and IP technology. This ensures broad support and rapid design-in capabilities with continuous technological advancement. 

For the rack offering, adopters benefit from our system partner network and data center infrastructure component providers that are already building the NVIDIA GB200 NVL72 and NVIDIA GB300 NVL72 systems in production volume. The combined ecosystem and supply chain enable adopters to accelerate their time to market, reducing bring-up time for the only rack-scale, scale-up fabric in production.

Greater performance for AI reasoning

NVLink represents a significant leap forward in addressing compute demand in the age of AI reasoning. By leveraging decade-long expertise in NVLink scale-up technologies, coupled with the open, production-deployed standards of the OCP MGX rack architecture and ecosystem, NVLink Fusion empowers hyperscalers with unparalleled performance and comprehensive customization options. 

Learn more about NVLink Fusion.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI模型复杂度 NVLink NVLink Fusion 计算资源 大规模并行化 NCCL SHARP协议 OCP MGX UCIe NVLink-C2C
相关文章