Nvidia Developer 10月07日 00:03
NVIDIA Blackwell架构引入硬件解压引擎,加速数据处理
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

NVIDIA在Blackwell架构中引入了硬件解压引擎(DE),并结合nvCOMP库,旨在解决数据解压带来的延迟和计算资源消耗问题。DE能够加速Snappy、LZ4和Deflate等常用格式的解压,将解压任务从通用计算单元中offload,从而释放GPU的计算能力用于其他任务。该引擎集成在复制引擎中,允许数据在传输过程中直接解压,减少I/O瓶颈。通过nvCOMP API,开发者可以轻松利用DE,实现代码的可移植性和性能提升,尤其适用于训练大型语言模型、分析基因组数据和进行HPC模拟等数据密集型工作负载。文章还详细介绍了如何通过cudaMallocFromPoolAsync和cuMemCreate等API来配置内存,以确保DE能够被有效利用,并提供了相关的最佳实践和性能对比。

💾 **硬件加速解压**:NVIDIA Blackwell架构新增的硬件解压引擎(DE)专门用于加速Snappy、LZ4和Deflate等格式的数据解压。通过将解压任务转移到专用硬件,DE能够显著减少CPU和GPU的负担,从而将宝贵的计算资源释放出来用于更核心的计算任务,提升整体系统效率。

🚀 **提升I/O性能与并行性**:DE集成在复制引擎中,允许数据在通过PCIe或C2C传输时即时解压,消除了传统先传输后解压的延迟,有效缓解了I/O瓶颈。此外,DE支持多流解压操作与GPU内核的并行执行,确保GPU资源的持续高利用率,这对于保持 Blackwell GPU的带宽不被I/O限制至关重要。

💡 **简化开发与跨平台兼容**:通过nvCOMP库提供的API,开发者可以轻松地利用DE的功能,同时保持代码的可移植性。nvCOMP能够根据DE的可用性自动切换使用硬件解压或基于SM的软件实现,确保代码在不同GPU上的兼容性。文章详细阐述了使用cudaMallocFromPoolAsync和cuMemCreate等API进行内存配置的步骤和注意事项,以便DE能够正确工作。

📊 **性能优势与使用场景**:与传统的基于流式多处理器(SM)的解压方式相比,DE在吞吐量和效率上具有显著优势,尤其是在处理大量数据时。这种硬件加速解压技术对于训练大型语言模型、分析海量基因组数据、运行高性能计算模拟等数据密集型应用场景尤为关键,能够显著提高处理速度并降低成本。

Compression is a common technique to reduce storage costs and accelerate input/output transfer times across databases, data-center communications, high-performance computing, deep learning, and more. But decompressing that data often introduces latency and consumes valuable compute resources, slowing overall performance.  

To address these challenges, NVIDIA introduced the hardware Decompression Engine (DE) in the NVIDIA Blackwell architecture—and paired it with the nvCOMP library. Together, they offload decompression from general-purpose compute, accelerate widely used formats like Snappy, and make adoption seamless. 

This blog will walk through how DE and nvCOMP work, the usage guidelines, and the performance benefits they unlock for data-intensive workloads. 

How the Decompression Engine works

The new DE in the Blackwell architecture is a fixed-function hardware block designed to accelerate decompression of Snappy, LZ4, and Deflate-based streams. By handling decompression in hardware, the DE frees up valuable streaming multiprocessor (SM) resources for compute, rather than burning cycles on data movement. 

Integrated as part of the copy engine, the DE eliminates the need for sequential host-to-device copies followed by software decompression. Instead, compressed data can be transferred directly across PCIe or C2C and decompressed in transit, reducing a major I/O bottleneck.

Beyond raw throughput, the DE enables true concurrency of data movement and compute. Multi-stream workloads can issue decompression operations in parallel with SM kernels, keeping the GPU fully utilized. In practice, this means data-intensive applications such as training LLMs, analyzing massive genomics datasets, or running HPC simulations can keep pace with the bandwidth of next-generation Blackwell GPUs without stalling on I/O. 

The benefits of nvCOMP’s GPU-accelerated compression

The NVIDIA nvCOMP library provides GPU-accelerated compression and decompression routines. It supports a wide range of standard formats, along with formats that NVIDIA has optimized for the best possible GPU performance. 

In the case of standard formats, CPUs and fixed function hardware frequently have architectural advantages over the GPU because of the limited parallelism available. The decompress engine is our solution to this problem for a range of workloads. The following sections will discuss further how to leverage nvCOMP to use DE. 

How to use DE and nvCOMP

It’s best for developers to leverage DE through nvCOMP APIs. Since the DE is only available on selected GPUs (as of now, the B200, B300, GB200, and GB300), using nvCOMP enables developers to write portable code that scales and works across GPUs as the DE footprint evolves over time. When the DE is available, nvCOMP will make use of it without changes to user code. If not, nvCOMP will fall back to its accelerated SM-based implementations. 

There are a few things you need to do to ensure this behavior on DE-enabled GPUs. nvCOMP generally allows input and output buffers of any type that are accessible to the device. The DE has stricter requirements. If your buffers do not meet these requirements, nvCOMP will also execute the decompress on SM. See Table 1 for a description of the allowed allocation types and their intended usages.

cudaMalloc Standard device-only allocation Device 
cudaMallocFromPoolAsync Easy-to-use pool-based allocations with more  Host/device 
cuMemCreate Low-level control of host/device allocations Host/device 
Table 1. Allowed allocation types and their intended usages

cudaMalloc allocations can be allocated as normal for device-to-device decompression. Host-to-device or even host-to-host decompression is possible if using cudaMallocFromPoolAsync or cuMemCreate, but care must be taken to set up the allocators properly. 

The following section will provide worked examples of how to use these different allocators. Note that in both cases, the only difference in standard use of these APIs is the addition of the cudaMemPoolCreateUsageHwDecompress and CU_MEM_CREATE_USAGE_HW_DECOMPRESS flags. In both examples, these allocations are placed on the first CPU NUMA node. 

Using cudaMallocFromPoolAsync 

The code example below shows how to create a pinned host memory pool with the cudaMemPoolCreateUsageHwDecompress flag, enabling allocations compatible with the DE.

cudaMemPoolProps props = {}; props.location.type = cudaMemLocationTypeHostNuma; props.location.id = 0; props.allocType  = cudaMemAllocationTypePinned; props.usage      = cudaMemPoolCreateUsageHwDecompress; cudaMemPool_t mem_pool; CUDA_CHECK(cudaMemPoolCreate(&mem_pool, &props)); char* mem_pool_ptr; CUDA_CHECK(cudaMallocFromPoolAsync(&mem_pool_ptr, 1024, mem_pool, stream));

Using cuMemCreate 

This example demonstrates how to use the low-level CUDA driver API (cuMemCreate) to allocate pinned host memory with the CU_MEM_CREATE_USAGE_HW_DECOMPRESS flag. It ensures the buffer is compatible with the DE.

CUdeviceptr mem_create_ptr; CUmemGenericAllocationHandle allocHandle; CUmemAllocationProp props = {}; props.location.type = CU_MEM_LOCATION_TYPE_HOST_NUMA; props.location.id = 0;  props.type = CU_MEM_ALLOCATION_TYPE_PINNED; props.allocFlags.usage = CU_MEM_CREATE_USAGE_HW_DECOMPRESS; size_t granularity; CU_CHECK(cuMemGetAllocationGranularity(&granularity, &props, CU_MEM_ALLOC_GRANULARITY_MINIMUM));      // Create the allocation handle CU_CHECK(cuMemCreate(&allocHandle, granularity, &props, 0));      // Reserve virtual address space CU_CHECK(cuMemAddressReserve(&mem_create_ptr, granularity, 0, 0, 0));  // Map the physical memory to the virtual address CU_CHECK(cuMemMap(mem_create_ptr, granularity, 0, allocHandle, 0)); 

Best practices for buffer batching

For best performance, the batch of buffers used for decompression (input/output/sizes) should be pointers that are offset into the same allocations. If providing a batch of buffers from different allocations, host driver launch overhead can be significant. 

uint8_t* d_decompressed_buffer; CUDA_CHECK(cudaMalloc(&d_decompressed_buffer, total_decompressed_size));      // Create pinned host arrays for device decompression pointers uint8_t** h_d_decompressed_ptrs; CUDA_CHECK(cudaHostAlloc(&h_d_decompressed_ptrs, actual_num_buffers * sizeof(uint8_t*), cudaHostAllocDefault));      // Fill the pinned host pointer arrays for device decompression using offsets size_t decompressed_offset = 0; for (int i = 0; i < actual_num_buffers; ++i) {     h_d_decompressed_ptrs[i] = d_decompressed_buffer + decompressed_offset;    decompressed_offset += input_sizes[i]} 

Note that due to synchronization requirements associated with the DE, nvCOMP’s asynchronous APIs will synchronize with the calling stream. Generally, nvCOMP will still return before the API finishes, so you’ll still need to synchronize the calling stream again before using the result of decompression if decompressing to the host. For device-side access, the decompress result is available in normal stream-ordering. 

On B200, if any buffer is larger than 4 MB, nvCOMP will fall back to an SM-based implementation. This limit might change in the future, and can be queried by the following code:  

int max_supported_size = 0; res = CudaDriver::cuDeviceGetAttribute(&max_supported_size,     CU_DEVICE_ATTRIBUTE_MEM_DECOMPRESS_MAXIMUM_LENGTH,     device_id);

How SM performance compares to DE 

DE provides faster decompression while freeing the SM for other work. The DE provides dozens of execution units compared to the thousands of warps available on the SMs. Each DE execution unit is much faster than an SM at executing decompress, but in some workloads, SM speed will approach DE when fully saturating the SM resources. Either SM or DE can execute using host pinned data as input, enabling zero-copy decompression. 

The following figure will demonstrate SM versus DE performance on the Silesia benchmark for LZ4, Deflate, and Snappy algorithms. Note that Snappy is newly optimized in nvCOMP 5.0, and further software optimization opportunities are possible for Deflate and LZ4.  

The performance measurement is done for 64 KiB and 512 KiB chunk sizes using “small” and “large” datasets. The large dataset is the full Silesia dataset, while the small dataset is the first ~50 MB of Silesia.tar (available here).  

Figure 1. Comparing the performance of streaming multiprocessors to the Decompression Engine, as shown in six examples.

Get started

The Decompression Engine in Blackwell makes it much easier to deal with one of the biggest challenges in data-heavy workloads: fast, efficient decompression. By moving this work to dedicated hardware, applications not only see faster results but also free up GPU compute for other tasks. 

With nvCOMP handling the integration automatically, developers can take advantage of these improvements without changing their code, leading to smoother pipelines and better performance.

To get started with these new features, explore the following resources:

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

NVIDIA Blackwell 硬件解压引擎 nvCOMP 数据处理 GPU Decompression Engine NVIDIA Blackwell Architecture Data Processing GPU Acceleration
相关文章