Nvidia Developer 02月16日
New Scaling Algorithm and Initialization with NVIDIA Collective Communications Library 2.23
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

NVIDIA NCCL 2.23版本在多GPU和多节点通信方面带来了显著的优化。该版本引入了PAT算法,加速了ReduceScatter和AllGather操作,实现了对数级别的扩展。同时,改进了初始化性能,支持使用带内网络进行引导通信,并新增了ncclCommInitRankScalable API,以加速大规模初始化。此外,该版本还支持节点内用户缓冲区注册,减少了数据拷贝,提升了通信性能。最后,新推出的Profiler插件API,方便用户监测和诊断性能异常。这些新特性共同提升了AI和HPC应用中并行计算的效率。

🚀PAT算法:针对ReduceScatter和AllGather操作,引入了基于Brucks算法的并行聚合树(PAT)算法,实现了对数级别的扩展,尤其适用于大规模语言模型(LLM)训练中数据并行性不完全对齐的场景。

⚡️加速初始化:通过消除一些引导集合操作以及优化引导步骤中的性能调优,显著提升了初始化代码的整体性能。同时,允许使用高速网络(IB/RoCE)进行带外通信,进一步加速初始化过程,但需通过`NCCL_OOB_NET_ENABLE=1`手动启用。

🗂️节点内用户缓冲区注册:支持节点内用户缓冲区(UB)注册,允许NCCL直接访问用户注册的缓冲区,避免了额外的内存拷贝,降低了内存子系统的压力,提升了NCCL通信性能,并改善了计算与通信的重叠。

🛠️Profiler插件API:推出全新的Profiler插件API,旨在帮助用户监测和诊断大规模GPU集群中的性能异常。该API设计易于被PyTorch Kineto等深度学习框架的Profiler采用,通过`NCCL_PROFILER_PLUGIN`环境变量控制插件加载和初始化,并提供详细的事件分层结构。

The NVIDIA Collective Communications Library (NCCL) implements multi-GPU and multinode communication primitives optimized for NVIDIA GPUs and networking. NCCL is a central piece of software for multi-GPU deep learning training. It handles any kind of inter-GPU communication, be it over PCI, NVLink, or networking. It uses advanced topology detection, optimized communication graphs, and tuning models to get the best performance straight out of the box on NVIDIA GPU platforms.In this post, we discuss the new features and fixes released in NCCL 2.23. Check out the NVIDIA/nccl GitHub repo. Release highlights and features NVIDIA Magnum IO NCCL is a library designed to optimize inter-GPU and multinode communication, crucial for efficient parallel computing in AI and high-performance computing (HPC) applications. The value of this release lies in its new features: New PAT algorithm for ReduceScatter and AllGather: We introduce the Parallel Aggregated Trees (PAT) algorithm, based on Brucks, for AllGather and ReduceScatter, achieving logarithmic scaling. Accelerated initialization: Improved initialization performance, including the ability to use in-band networking for bootstrap communication. ncclCommInitRankScalable: A new initialization API, for using multiple ncclUniqueIds to speed up initialization at large scales. Intranode user buffer registration: Take advantage of registered user buffers for intranode operations. New profiler plugin API: API hooks to measure fine-grain NCCL performance. The following sections dive deeper into the details of the new features: PAT logarithmic scaling for ReduceScatter and AllGather   New ncclCommInitRankScalable API Accelerated bootstrap operations Intranode user buffer registration New profiler plugin API Bug fixes and minor featuresPAT logarithmic scaling for ReduceScatter and AllGather The PAT algorithm is a variation of the Bruck algorithm, which features a logarithmic number of network steps for small sizes at scale, progressively increasing the number of network transfers as sizes increase, to keep buffering needs minimal. It applies to both AllGather and ReduceScatter. You can expect small to medium message sizes to perform better with PAT, with this improvement increasing as your workload scales. This algorithm is executing a binomial tree shifted for each rank. Its advantage compared to similar algorithms like recursive doubling is that it works on any number of ranks and does not require a power of two. Initially, PAT only supports one GPU per node. The case of one GPU per node ReduceScatter and AllGather is important for large language model (LLM) training, where pipeline parallelism and tensor parallelism are in dimensions orthogonal to data parallelism. The tensor parallelism dimension is usually aligned to the intranode NVLink connectivity, meaning that other dimensions will only have one GPU per node. Look for our forthcoming paper describing the details of the algorithm.New ncclCommInitRankScalable API This feature adds a new initialization function, ncclCommInitRankScalable, to enable leveraging multiple unique IDs during the communicator creation. This addition avoids the all-to-one communication patterns during the initialization and provides a more scalable initialization performance. At communicator creation, NCCL needs to obtain the addresses of all the communicator’s ranks (bootstrap step). To do so, NCCL relies on a unique ID known to all the ranks. During the bootstrap step of the communicator initialization, each rank exchanges its address with the known unique ID, creating an all-to-one communication pattern, and a significant bottleneck at scale. With ncclCommInitRankScalable, the user is now free to provide more than one unique ID to be used during the bootstrap. To achieve the highest gain, NCCL will spread the load across multiple unique IDs, enabling a constant bootstrap time at scale, if the number of unique IDs provided scales with the size of the communicator. This new API requires multiple ranks to create a unique ID. To obtain the best performance, we recommend spreading the unique IDs as homogeneously as possible among the ranks. Accelerated bootstrap operationsIn the 2.23 release, we improved the overall performance of the initialization code. We eliminated some of the bootstrap collectives needed, as well as performance tuning in the bootstrap step. You can now use the fast network (IB/RoCE/…) for out-of-band communication to speed up the two linear steps of the initialization, bootstrap and allgather. That feature is disabled by default to avoid using wrongly configured devices (the use of ncclNet devices happens before the topology detection). You can enable it with NCCL_OOB_NET_ENABLE=1. Additionally, you can specify which interface should be used with NCCL_OOB_NET_IFNAME. By default, NCCL will use the first ncclNet device found on that network.Intranode user buffer registration NCCL never requires you as the user to register and maintain any persistent buffers to function. This is a great feature for ease of usability, but it does come with performance tradeoffs. Without direct access, more control flow and buffering must occur when NCCL transfers data. This consumes more GPU resources and results in higher overheads for moving the same amount of data compared to explicitly registered and mapped buffers.Whenever possible, NCCL developers are advised to register their buffers using ncclCommRegister to allow NCCL to use all available optimizations. The NCCL team is always working to add more use cases for registered user buffers. The 2.23 release implements intranode user buffer (UB) registration support for NvLink and PCIe P2P transports.The main benefit of Intranode UB registration is to avoid extra copies among peers. This reduces pressure on the memory subsystem, improves NCCL communication performance, and also improves the computation and communication overlap. All NCCL collectives and sendrecv-based operations are supported except for ncclReduce and ncclReduceScatter (they would not benefit). There are two ways to enable intranode UB registration. The first one is registering buffers through ncclCommRegister explicitly, and the buffers will be registered only when the corresponding NCCL collectives are called. The second is by capturing NCCL operations through CUDA Graphs, and all user buffers will be automatically registered during graph capture. For more guidelines and requirements, refer to the NCCL documentation. In addition to intranode communication over NVLink and PCIe, the feature works on multinode NVLink (MNNVL) systems within each NVLink domain.New profiler plugin API As the GPU clusters’ scale increases, performance anomalies become harder to detect and root cause. Domain-specific monitoring and diagnostic tools are needed to collect and analyze telemetry data with minimal overhead for the running jobs. The NCCL profiler plugin interface has been designed to address these concerns. The interface design also makes it easy to adopt by DL framework profilers such as PyTorch Kineto. The new NCCL_PROFILER_PLUGIN environment variable controls profiler plugin loading and initialization in the same way other NCCL plugins are loaded and initialized. Once loaded, the profiler plugin can enable NCCL events profiling by setting the event activation mask that NCCL exposes to the profiler during initialization. The event activation mask is a 32-bit integer where every bit represents a NCCL profiler event. Currently, NCCL supports the following events: ncclProfileGroup (bit-0): Group event ncclProfileColl (bit-1): Collective event ncclProfileP2p (bit-2): Point-to-point event ncclProfileProxyOp (bit-3): Proxy progress channel event ncclProfileProxyStep (bit-4): Proxy progress step event ncclProfileProxyCtrl (bit-5): Proxy progress internal state event NCCL expresses events in a hierarchical form. For example, collectives can be grouped together, and proxy operations assist the GPU with point-to-point transfer of individual data chunks across the available network communication channels. Therefore, NCCL presents the corresponding events to the profiler, preserving this relationship. A diagram for the NCCL event hierarchy is shown below:ncclProfileGroup | +- ncclProfileColl | | | +- ncclProfileProxyOp | | | +- ncclProfileProxyStep | +- ncclProfileP2p | +- ncclProfileProxyOp | +- ncclProfileProxyStep ncclProfileProxyCtrlThis hierarchical representation enables profiler plugins to present events to users in a more meaningful and comprehensible form. NCCL also provides an example profiler plugin in the ext-profiler/example directory that can be used as a template to develop third-party profiler plugins. In total, the profiler plugin interface defines the following five function callbacks: ncclResult_t (*init)( void* context, int eActivationMask); ncclResult_t (startEvent)( void context, void* eHandle, ncclProfilerEventDescr_t eDescr); ncclResult_t (stopEvent)( void eHandle); ncclResult_t (recordEventState)( void eHandle, ncclProfilerEventState_t eState, NcclProfilerEventStateArgs_t eStateArgs); ncclResult_t (finalize)(void context);The profiler init function takes an event activation mask pointer and returns an opaque context object to NCCL. The context provides isolation between profiler instances, while the event activation mask is used by the profiler to notify NCCL about what events should be profiled; for example, setting eActivationMask = ncclProfileColl | ncclProfileProxyOp. The profiler startEvent function takes a profiler context and an event descriptor. The profiler uses the descriptor information to allocate a new event object and initialize it. Afterwards, the profiler returns an opaque handle that NCCL can use to perform further operations on the event; for example, record state updates. The profiler stopEvent function takes an event handle and marks the event as complete. Afterwards, the event handle can no longer be used (the profiler might internally recycle the corresponding object for future events). The profiler recordEventState function takes an event handle, an event state, and (optionally) an event state argument object. This function enables the profiler to update events that can transition through different states in NCCL. One example is proxy events, where the proxy needs to coordinate with both the GPU and the network while transferring data, moving from one state to another in the process. The profiler finalize function takes the profiler context and releases all the resources associated with it.Bug fixes and minor features NCCL 2.23 provides the following additional updates: Asynchronous graph allocation makes calls to cudaMalloc and cudaMemcpy during graph allocation asynchronous. Significantly speeds up graph capture. Use fatal IB asynchronous events to stop network operations helps catch link down errors and other fatal asynchronous events within NCCL. Set P2P level to PXB on AMD CPUs when using more than two GPUs per node. Improve the init logs to report the actual NCCL function: Informs the user if NCCL is performing ncclCommInitRank or ncclCommSplit. Add NCCL_CONF_FILE variable Increase default IB timeout from 18 to 20 Add new check for NVIDIA peermem. Works with recent Linux kernels. Fix old performance regression. When mixing small and large operations. Fix crash when NUMA IDs are equal to -1. Fix tree graph search when NCCL_CROSS_NIC is set to 1. SummaryNVIDIA NCCL 2.23 introduces new features and improvements for optimizing inter-GPU and multinode communication, crucial for AI and HPC applications. Key enhancements include the new PAT Algorithm, accelerated initialization at scale, intranode user buffer registration, and the new profiler plugin API. To learn more about the previous release, see Memory Efficiency, Faster Initialization, and Cost Estimation with NVIDIA Collective Communications Library 2.22. Learn more about Magnum IO and NCCL.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

NCCL 多GPU通信 并行计算 深度学习 性能优化
相关文章