AWS Machine Learning Blog 08月13日
Train and deploy AI models at trillion-parameter scale with Amazon SageMaker HyperPod support for P6e-GB200 UltraServers
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Amazon SageMaker HyperPod现已支持NVIDIA GB200 UltraServers,为AI创新带来革命性算力。该服务器集成了多达72颗NVIDIA Blackwell GPU,提供高达360 petaflops的FP8和1.4 exaflops的FP4计算能力。通过NVIDIA NVLink技术,GPU、CPU之间实现超高速互联,大幅提升数据传输效率。SageMaker HyperPod的分布式训练环境能充分发挥GB200 UltraServers的潜力,简化大规模AI模型的开发、训练和部署流程,显著缩短迭代周期,降低运营成本,助力企业在AI浪潮中抢占先机。

🚀 **强大的计算能力与互联技术**:P6e-GB200 UltraServers搭载高达72颗NVIDIA Blackwell GPU,提供360 petaflops FP8和1.4 exaflops FP4计算能力,并拥有13.4 TB HBM3e内存。Grace Blackwell Superchip通过NVLink-C2C互联技术,实现GPU与CPU之间数量级提升的带宽,加速AI工作负载。

🌐 **卓越的网络性能**:该服务器提供高达130 TBps的GPU间低延迟NVLink带宽,第五代NVLink提供1.8 TBps的GPU直连带宽。每个计算节点最多可配置17个400 Gbps的NIC,总计提供28.8 Tbps的EFA v4网络带宽,确保大规模AI任务的高效通信。

💾 **高效的存储与数据吞吐**:支持高达405 TB的本地NVMe SSD存储,方便存储大型数据集和快速模型检查点。通过EFA和GPUDirect Storage (GDS)访问Amazon FSx for Lustre文件系统,实现TBps级别吞吐量和百万级IOPS的数据传输,满足AI训练和推理的严苛需求。

🧠 **优化的调度与资源管理**:Amazon EC2提供实例拓扑信息,帮助训练和推理算法优化通信模式和数据放置。SageMaker HyperPod结合Amazon EKS或Slurm,自动进行节点标签和配置,实现拓扑感知调度,最大限度地提升作业性能和资源利用率。

💡 **广泛的应用场景与部署优势**:P6e-GB200 UltraServers能够高效训练超万亿参数模型,实现30倍的超大模型推理加速。通过与NVIDIA Dynamo等框架结合,优化长上下文长度处理,支持高并发应用。为初创、研究及企业客户提供灵活的训练计划和资源池化,加速AI创新与落地。

Imagine harnessing the power of 72 cutting-edge NVIDIA Blackwell GPUs in a single system for the next wave of AI innovation, unlocking 360 petaflops of dense 8-bit floating point (FP8) compute and 1.4 exaflops of sparse 4-bit floating point (FP4) compute. Today, that’s exactly what Amazon SageMaker HyperPod delivers with the launch of support for P6e-GB200 UltraServers. Accelerated by NVIDIA GB200 NVL72, P6e-GB200 UltraServers provide industry-leading GPU performance, network throughput, and memory for developing and deploying trillion-parameter AI models at scale. By seamlessly integrating these UltraServers with the distributed training environment of SageMaker HyperPod, organizations can rapidly scale model development, reduce downtime, and simplify the transition from training to large-scale deployment. With the automated, resilient, and highly scalable machine learning infrastructure of SageMaker HyperPod, organizations can seamlessly distribute massive AI workloads across thousands of accelerators and manage model development end-to-end with unprecedented efficiency. Using SageMaker HyperPod with P6e-GB200 UltraServers marks a pivotal shift towards faster, more resilient, and cost-effective training and deployment for state-of-the-art generative AI models.

In this post, we review the technical specifications of P6e-GB200 UltraServers, discuss their performance benefits, and highlight key use cases. We then walk though how to purchase UltraServer capacity through flexible training plans and get started using UltraServers with SageMaker HyperPod.

Inside the UltraServer

P6e-GB200 UltraServers are accelerated by NVIDIA GB200 NVL72, connecting 36 NVIDIA Grace CPUs and 72 Blackwell GPUs in the same NVIDIA NVLink domain. Each ml.p6e-gb200.36xlarge compute node within an UltraServer includes two NVIDIA GB200 Grace Blackwell Superchips, each connecting two high-performance NVIDIA Blackwell GPUs and an Arm-based NVIDIA Grace CPU with the NVIDIA NVLink chip-to-chip (C2C) interconnect. SageMaker HyperPod is launching P6e-GB200 UltraServers in two sizes. The ml.u-p6e-gb200x36 UltraServer includes a rack of 9 compute nodes fully connected with NVSwitch (NVS), providing a total of 36 Blackwell GPUs in the same NVLink domain, and the ml.u-p6e-gb200x72 UltraServer includes a rack-pair of 18 compute nodes with a total of 72 Blackwell GPUs in the same NVLink domain. The following diagram illustrates this configuration.

Performance benefits of UltraServers

In this section, we discuss some of the performance benefits of UltraServers.

GPU and compute power

With P6e-GB200 UltraServers, you can access up to 72 NVIDIA Blackwell GPUs within a single NVLink domain, with a total of 360 petaflops of FP8 compute (without sparsity), 1.4 exaflops of FP4 compute (with sparsity) and 13.4 TB of high-bandwidth memory (HBM3e). EachGrace Blackwell Superchip pairs two Blackwell GPUs with one Grace CPU through the NVLink-C2C interconnect, delivering 10 petaflops of dense FP8 compute, 40 petaflops of sparse FP4 compute, up to 372 GB HBM3e, and 850GB of cache-coherent fast memory per module. This co-location boosts bandwidth between GPU and CPU by an order of magnitude compared to previous-generation instances. Each NVIDIA Blackwell GPU features a second-generation Transformer Engine and supports the latest AI precision microscaling (MX) data formats such as MXFP6 and MXFP4, as well as NVIDIA NVFP4. When combined with frameworks like NVIDIA Dynamo, NVIDA TensorRT-LLM and NVIDIA NeMo, these Transformer Engines significantly accelerate inference and training for large language models (LLMs) and Mixture-of-Experts (MoE) models, supporting higher efficiency and performance for modern AI workloads.

High-performance networking

P6e-GB200 UltraServers deliver up to 130 TBps of low-latency NVLink bandwidth between GPUs for efficient large-scale AI workload communication. At double the bandwidth of its predecessor, the fifth-generation NVIDIA NVLink provides up to 1.8 TBps of bidirectional, direct GPU-to-GPU interconnect, greatly enhancing intra-server communication. Each compute node within an UltraServer can be configured with up to 17 physical network interface cards (NICs), each supporting up to 400 Gbps of bandwidth. P6e-GB200 UltraServers provide up to 28.8 Tbps of total Elastic Fabric Adapter (EFA) v4 networking, using the Scalable Reliable Datagram (SRD) protocol to intelligently route network traffic across multiple paths, providing smooth operation even during congestion or hardware failures. For more information, refer to EFA configuration for a P6e-GB200 instances.

Storage and data throughput

P6e-GB200 UltraServers support up to 405 TB of local NVMe SSD storage, ideal for large-scale datasets and fast checkpointing during AI model training. For high-performance shared storage, Amazon FSx for Lustre file systems can be accessed over EFA with GPUDirect Storage (GDS), providing direct data transfer between the file system and the GPU memory with TBps of throughput and millions of input/output operations per second (IOPS) for demanding AI training and inference.

Topology-aware scheduling

Amazon Elastic Compute Cloud (Amazon EC2) provides topology information that describes the physical and network relationships between instances in your cluster. For UltraServer compute nodes, Amazon EC2 exposes which instances belong to the same UltraServer, so you’re training and inference algorithms can understand NVLink connectivity patterns. This topology information helps optimize distributed training by allowing frameworks like the NVIDIA Collective Communications Library (NCCL) to make intelligent decisions about communication patterns and data placement. For more information, see How Amazon EC2 instance topology works.

With Amazon Elastic Kubernetes Service (Amazon EKS) orchestration, SageMaker HyperPod automatically labels UltraServer compute nodes with their respective AWS Region, Availability Zone, Network Node Layers (1–4), and UltraServer ID. These topology labels can be used with node affinities, and pod topology spread constraints to assign Pods to cluster nodes for optimal performance.

With Slurm orchestration, SageMaker HyperPod automatically enables the topology plugin and creates a topology.conf file with the respective BlockName, Nodes, and BlockSizes to match your UltraServer capacity. This way, you can group and segment your compute nodes to optimize job performance.

Use cases for UltraServers

P6e-GB200 UltraServers can efficiently train models with over a trillion parameters due to their unified NVLink domain, ultrafast memory, and high cross-node bandwidth, making them ideal for state-of-the-art AI development. The substantial interconnect bandwidth makes sure even extremely large models can be partitioned and trained in a highly parallel and efficient manner without the performance setbacks seen in disjointed multi-node systems. This results in faster iteration cycles and higher-quality AI models, helping organizations push the boundaries of state-of-the-art AI research and innovation.

For real-time trillion-parameter model inference, P6e-GB200 UltraServers enable 30 times faster inference on frontier trillion-parameter LLMs compared to prior platforms, achieving real-time performance for complex models used in generative AI, natural language understanding, and conversational agents. When paired with NVIDIA Dynamo, P6e-GB200 UltraServers deliver significant performance gains, especially for long context lengths. NVIDIA Dynamo disaggregates the compute-heavy prefill phase and the memory-heavy decode phase onto different GPUs, supporting independent optimization and resource allocation within the large 72-GPU NVLink domain. This enables more efficient management of large context windows and high-concurrency applications.

P6e-GB200 UltraServers offer substantial benefits to startup, research, and enterprise customers with multiple teams that need to run diverse distributed training and inference workloads on shared infrastructure. When used in conjunction with SageMaker HyperPod task governance, UltraServers provide exceptional scalability and resource pooling, so different teams can launch simultaneous jobs without bottlenecks. Enterprises can maximize infrastructure utilization, reduce overall costs, and accelerate project timelines, all while supporting the complex needs of teams developing and serving advanced AI models, including massive LLMs for high-concurrency real-time inference, across a single, resilient platform.

Flexible training plans for UltraServer capacity

SageMaker AI currently offers P6e-GB200 UltraServer capacity through flexible training plans in the Dallas AWS Local Zone (us-east-1-dfw-2a). UltraServers can be used for both SageMaker HyperPod and SageMaker training jobs.

To get started, navigate to the SageMaker AI training plans console, which includes a new UltraServer compute type, from which you can select your UltraServer type: ml.u-p6e-gb200x36 (containing 9 ml.p6e-gb200.36xlarge compute nodes) or ml.u-p6e-gb200x72 (containing 18 ml.p6e-gb200.36xlarge compute nodes).

After finding the training plan that fits your needs, it is recommended that you configure at least one spare ml.p6e-gb200.36xlarge compute node to make sure faulty instances can be quickly replaced with minimal disruption.

Create an UltraServer cluster with SageMaker HyperPod

After purchasing an UltraServer training plan, you can add the capacity to an ml.p6e-gb200.36xlarge type instance group within your SageMaker HyperPod cluster and specify the quantity of instances that you want to provision up to the amount available within the training plan. For example, if you purchased a training plan for one ml.u-p6e-gb200x36 UltraServer, you could provision up to 9 compute nodes, whereas if you purchased a training plan for one ml.u-p6e-gb200x72 UltraServer, you could provision up to 18 compute nodes.

By default, SageMaker will optimize the placement of instance group nodes within the same UltraServer so that GPUs across nodes are interconnected within the same NVLink domain to achieve the best data transfer performance for your jobs. For example, if you purchase two ml.u-p6e-gb200x72 UltraServers with 17 compute nodes available each (assuming you configured two spares), then create an instance group with 24 nodes, the first 17 compute nodes will be placed on UltraServer A, and the other 7 compute nodes will be placed on UltraServer B.

Conclusion

P6e-GB200 UltraServers help organizations train, fine-tune, and serve the world’s most ambitious AI models at scale. By combining extraordinary GPU resources, ultrafast networking, and industry-leading memory with the automation and scalability of SageMaker HyperPod, enterprises can accelerate the different stages of the AI lifecycle, from experimentation and distributed training through seamless inference and deployment. This powerful solution breaks new ground in performance and flexibility and reduces operational complexity and costs, so that innovators can unlock new possibilities and lead the next era of AI advancement.


About the authors

Nathan Arnold is a Senior AI/ML Specialist Solutions Architect at AWS based out of Austin Texas. He helps AWS customers—from small startups to large enterprises—train and deploy foundation models efficiently on AWS. When he’s not working with customers, he enjoys hiking, trail running, and playing with his dogs.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Amazon SageMaker HyperPod NVIDIA GB200 UltraServers AI模型训练 深度学习 高性能计算
相关文章