Nvidia Developer 16小时前
大规模专家并行优化AI推理性能
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

现代AI工作负载已超越单GPU推理服务,模型并行性通过多GPU高效分割计算成为可扩展部署的基础。混合专家(MoE)架构因仅激活部分参数而更高效,但扩展MoE需优化并行性、通信和调度。专家并行(EP)通过跨GPU分布专家克服挑战,如DeepSeek-R1模型需8+GPU实现大规模EP。NVIDIA Tensor RT-LLM的Wide-EP技术通过算法优化计算和内存瓶颈,结合GB200 NVL72系统实现高效MoE推理。该技术减少权重加载压力,优化通信,并通过动态负载均衡提升GPU利用率,显著提高吞吐量和降低成本。

🔹专家并行(EP)通过跨GPU分布MoE模型中的专家,利用组合计算和内存带宽实现高效扩展。大规模EP指在8+GPU上分布专家,增加聚合带宽以加速权重加载并支持更大批处理规模,解决小规模EP的内存压力和利用率问题。

📊 MoE模型通过动态加载激活专家的权重,显著降低单token计算需求,但高吞吐量场景中权重加载成为MoE GroupGEMM(将tokens批量处理为高效矩阵乘法)的主要瓶颈。大规模EP通过减少每个GPU持有的专家数量,减轻权重加载压力,并提高GroupGEMM的算术强度和权重重用率。

🚀 GB200 NVL72系统通过提供130 TB/s的聚合NVLink带宽,缓解大规模EP中的分布式专家通信开销。自定义EP通信内核优化CUDA图兼容性,处理非静态数据大小,利用NVL72聚合内存提升效率,使MoE GroupGEMM的token收集和重排序操作可行。

⚙️ 宽专家并行(Wide-EP)结合NVIDIA Dynamo(解耦推理编排层)和TensorRT-LLM(专家并行解码引擎)优化性能。Dynamo管理预填充和解码阶段跨GPU池的调度,Wide-EP通过高效分配每个GPU少量专家优化内存和计算。两者协同实现SLA感知自动扩展、实时流量适应和硬件协同,提升吞吐量并降低延迟。

📈 大规模EP的性能和成本效益取决于模型大小、专家数量、系统延迟和硬件能力。DeepSeek-R1等大型MoE模型是Wide-EP的理想候选,在GB200 NVL72上可提升1.8倍GPU吞吐量。通过动态负载均衡(EPLB)防止“热专家”集中,支持静态或在线模式优化实时分配,进一步最大化利用率。

Modern AI workloads have moved well beyond single-GPU inference serving. Model parallelism, which efficiently splits computation across many GPUs, is now the foundation of scalable, state-of-the-art deployments. The highest-performing models increasingly adopt mixture-of-experts (MoE) architectures, which are more efficient than dense models because they activate only a subset of trained parameters per token. However, scaling MoEs introduces more complex parallelism, communication, and scheduling requirements that must be carefully optimized.

Expert parallelism (EP), the strategic distribution of experts across multiple GPUs, is essential to overcoming these challenges and unlocking scalable performance. As models like DeepSeek-R1, with 256 experts and 671 billion parameters, continue to grow, new tools are needed—such as NVIDIA Tensor RT-LLM’s Wide Expert Parallelism, or Wide-EP. It makes large-scale deployment more efficient, improving both performance and total cost of ownership. 

In this blog, we break down how large-scale EP impacts performance and reshapes inference economics in the NVL72 rack-scale domain.

How to achieve large-scale expert parallelism

Expert parallelism (EP) is a model-parallel technique that distributes a MoE model’s experts across multiple GPUs to take advantage of combined compute and memory bandwidth. At smaller scales, EP helps reduce memory pressure and keep utilization high by balancing work across devices. 

Figure 1. Animation showing how small-scale EP deploys many experts per GPU, while large-scale EP spreads fewer experts per GPU across a much larger cluster, enabling efficient scaling of MoE layers.

As models like DeepSeek-R1 grow to hundreds of billions of parameters with hundreds of experts, these same techniques must expand in scope, leading to what we call large-scale EP. For the purposes of this blog, large-scale EP refers to the process of distributing experts across eight or more GPUs. This increases aggregated bandwidth for faster weight loading and supports larger effective batch sizes to improve overall GPU utilization.

What are memory and compute challenges of large-scale EP?

MoE models provide the added benefit of only activating a small subset of experts during inference—significantly reducing the per token compute requirement. To achieve this, MoEs dynamically load the weights of an activated expert on a per token per layer basis. In high throughput, latency-constrained scenarios, weight-loading overhead can quickly become a major bottleneck for a specific type of compute process called MoE GroupGEMMs. 

MoE GroupGEMMs are like sending all tokens to the same checkout lane at the same time, so they can be processed in one efficient batch. In practice, they are grouped matrix multiplications that batch tokens per expert into a single large calculation. That boosts arithmetic intensity, but it requires loading each expert’s weights into on-chip memory/registers before multiplication.

Large-scale EP addresses some of the MoE GroupGEMM bottlenecks by introducing more GPUs into the expert parallel configuration, efficiently reducing the number of experts held by each GPU. This results in:

    Less weight-loading pressure (smaller set of expert weights per GPU)Easier reuse of weights by the GroupGEMM kernel (higher arithmetic intensity—more FLOPs per byte of weight loaded)Better compute/memory balance inside the kernel

While large-scale EP helps address the limitations of small-scale EP, it also introduces new system-level constraints that make scaling large MoEs difficult. TensorRT-LLM Wide-EP helps address these constraints by targeting compute and memory bottlenecks algorithmically while also tackling workload management at the system and architecture level. 

Let’s examine how wide-EP, when paired with GB200 NVL72, provides the foundation for scalable and efficient MoE inference.

What’s the system design and architecture?

Scaling expert parallelism requires more than adding GPUs. It depends on system design and architecture that keep memory movement and communication efficient. Interconnect bandwidth and topology provide the foundation, allowing activations and weights to flow smoothly across devices. 

On top of this, optimized software and kernels manage expert-to-expert traffic with communication primitives, bandwidth-aware scheduling, and load balancing. Together, these capabilities make large-scale EP practical and efficient.

One of the biggest bottlenecks in large-scale EP is communication overhead. During the decode phase of inference, distributed experts must exchange information to consolidate the outputs of multiple GPUs across the system. For instance, when distributing DeepSeek-R1’s 256 experts across 64 GPUs with eight active experts per token (See Figure 3 below), the communication cost depends on which experts are activated at a given layer and where their weights are located.

Figure 3. Schematic diagram showing an MoE deployment with 232 experts per GPU and only four activated per layer, coordinated across 72 GPUs in a GB200 NVL72 NVLink domain.

While large-scale EP reduces weight-loading overhead for activated experts, these gains can be offset by token-gather collectives that must consolidate distributed outputs and reorder tokens before passing them to the next transformer block or the final softmax layer. Without the 130 TB/s of aggregate bandwidth provided by the NVL72, the complexity and overhead of this communication pattern would make large-scale EP impractical.

Optimizing kernels for optimal expert routing with NCCL

MoEs leverage a routing mechanism to dynamically select the most appropriate experts per token. This means that every transformer block requires per token dispatching and aggregation after they pass through expert layers. The all-to-all operations involved can quickly saturate an already memory-bound decode phase. 

To address these challenges, custom EP communication kernels are required. For GB200 NVL72, we have implemented custom kernels to address CUDA graph compatibility with multiple rack-scale deployment scenarios. Of note are custom high-performance NCCL kernels designed to handle non-static data sizes across large-scale EP deployments. These custom EP kernels are able to accept communication sizes directly from GPU memory and take advantage of the NVL72 aggregate memory. 

Load balancing wide experts

Load balancing is a classic distributed systems technique that assigns work based on resource availability to maximize utilization without overloading any single part of the system. In the case of large-scale EP workloads, load balancing is used to distribute experts among the available GPUs. For example, in a GB200 NVL72 rack running Wide-EP DeepSeek-R1 with EP=64 (for clean division), we would distribute four experts per GPU per layer, for a total of 232 experts assigned per GPU.

To prevent load-balancing scenarios where a collection of very popular “hot experts” all sit on the same GPU while other GPUs with less popular “cold experts” sit idle, Wide-EP’s Expert Parallel Load Balancer (EPLB) leverages a policy to redistribute hot experts alongside cold experts. This triggers a weight update process, addressed by using a containerized design that allows experts to flow in and out of container allocations without breaking the CUDA graph. These weight updates are performed in a non-blocking fashion by scheduling them between forward passes.

Figure 4. Diagram showing Expert Parallel Load Balancer (EPLB) redistributes experts to ensure balanced GPU workload, preventing over- and under-utilization.

The EPLB can operate in two different modes: 

    Static EPLB: pre-computed expert-to-GPU mappings based on historical data patterns are used to optimize expert allocation.Online EPLB: Experts are redistributed during runtime dynamically to adapt real-time to changing workload patterns. 

While static EPLB offers good baseline improvements over a non-EPLB approach, online EPLB provides the highest potential for optimal load balancing in real-time production systems. In our initial implementation of online EPLB, we encountered and patched several critical challenges associated with real-time weight-updating processes.

Wide-EP with TensorRT-LLM and NVIDIA Dynamo

When deploying MoE models like DeepSeek R1 or Llama 4 at scale, inference performance hinges on two key pillars: disaggregated serving and Wide-EP. NVIDIA Dynamo and TensorRT-LLM form the software backbone that enables both, transforming traditional bottlenecks into opportunities for massive throughput gains and efficient GPU utilization. The table below outlines the differences and synergies between Dynamo and Wide-EP.

ComponentNVIDIA DynamoTensorRT-LLM Wide-EP
RoleOrchestration layer for disaggregated inferenceExecution engine for expert-parallel decoding
Optimization ScopeOrchestrates prefill & decode phases across GPU poolsDistributes small number of experts per GPU to optimize per token memory and compute utilization
SLA AwarenessSLA-aware autoscaling and dynamic rate matching (TTFT & ITL) Maximizes batching & minimizes latency through efficient expert scheduling
Traffic AdaptionReacts in real-time to ISL/OSL fluctuations via the Dynamo PlannerLoad balances expert allocations to optimize compute utilization
Hardware SynergyScales via Kubernetes + Planner logic across disaggregated GPU domainsLeverages high-bandwidth domains (e.g. NVL72) for efficient expert communication
Table 1. Comparison of NVIDIA Dynamo and TensorRT-LLM Wide-EP for expert-parallel inference, highlighting roles, optimization scope, SLA awareness, traffic adaption, and hardware synergy.

For more insights into the relationships between NVIDIA Dynamo and TensorRT-LLM Wide-EP, we encourage you to review our blog on leveraging NVIDIA Dynamo for large-scale expert parallelism. 

What are the performance and workload economics?

When you have access to the coherent memory domain created by NVLink scale-up in an GB200 NVL72 rack, optimizing large-scale EP comes down to a few critical factors:

    Model size and number of experts: Smaller models with fewer experts gain less from Wide-EP because communication overhead can outweigh the benefits of reduced weight loading and distributed compute.System latency and concurrency goals: Large-scale EP is most effective when throughput is constrained by latency, allowing for greater per GPU throughput at iso-latency. Hardware capabilities: Aggregate memory bandwidth, inter-GPU bandwidth, and achievable compute determine whether the system can reach the optimal degree of parallelism.

In practice, models like DeepSeek-R1 are strong candidates for large-scale EP, where TensorRT-LLM’s Wide-EP on GB200 NVL72 rack-scale systems delivers the best balance of efficiency and throughput. The Pareto frontiers below highlight performance across different EP configurations.

Figure 5. Large-scale Expert Parallelism (EP) rank 32 delivers up to 1.8x higher output token throughput per GPU compared to small EP rank 8 at 100 tokens/sec per user. Both configurations leverage disaggregated serving and multi-token prediction (MTP).

Compared to the small EP configuration (EP8), the large EP configuration (EP32) achieves up to 1.8x more per-GPU throughput. This highlights the performance uplift opportunity from leveraging large-scale EP and Wide-EP. An additional opportunity exists to leverage speculative decoding with multi-token prediction (MTP) to boost per-user token throughput—this functionality is already compatible with Wide-EP.

Summary

Wide-EP on GB200 NVL72 provides a practical path to scaling large MoE models. Distributing experts across more GPUs reduces weight-loading pressure, improves GroupGEMM efficiency, and leverages GB200 NVL72’s 130 TB/s coherent NVLink domain to offset communication overhead. In testing, large EP configurations reached up to 1.8x higher per-GPU throughput than smaller EP setups. These gains shift the balance of throughput, latency, and utilization in favor of more efficient large-scale inference.

The broader impact is on system economics. By enabling higher concurrency and stronger GPU efficiency, Wide-EP on NVL72 improves tokens/second/GPU and lowers the overall cost of serving large models. For developers, this means exploring Wide-EP in TensorRT-LLM to find optimal configurations. For researchers, it creates room to refine scheduling, load balancing, and decoding strategies. For infrastructure teams, it highlights how GB200 NVL72 can change the TCO profile of trillion-parameter deployments.

For more, check out how large-scale EP with GB200 NVL72 led to the lowest TCO of all other system architectures in the latest InferenceMAX benchmarks.

And for up-to-date performance insights check out the NVIDIA Inference Performance dashboard.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

专家并行 大规模并行 混合专家模型 NVIDIA TensorRT-LLM GB200 NVL72 MoE推理优化 通信优化 负载均衡 NVIDIA Dynamo
相关文章