Nvidia Developer 前天 00:30
NVIDIA CUDA-X 数学库提升 AI 与科学计算性能
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

NVIDIA CUDA-X 数学库,特别是 cuBLAS,在最新发布的 CUDA Toolkit 13.0 Update 2 中迎来了重要更新。此次更新引入了对双精度(FP64)矩阵乘法(matmuls)的性能显著提升,通过在 NVIDIA GB200 NVL72 等 GPU 架构的 Tensor Cores 上进行浮点(FP)模拟来实现。这使得开发者能够更轻松地利用 Tensor Core 的强大性能,同时保持或提升数值精度。cuBLAS 的自动动态精度(ADP)框架能够智能地评估并选择最佳的计算方式,确保在性能和精度之间取得平衡。文章还通过 ecTrans 和 BerkeleyGW 等实际应用案例,展示了 FP32 和 FP64 模拟在加速计算和科研探索方面的显著优势,预示着未来将有更多基于 FP 模拟的性能提升。

🚀 **cuBLAS 性能飞跃**: NVIDIA CUDA Toolkit 13.0 Update 2 中的 cuBLAS 库通过在 Tensor Cores 上进行 FP32 和 FP64 浮点模拟,显著提升了矩阵乘法的性能。对于 FP64 矩阵乘法,采用了 Ozaki Scheme 并结合自动动态精度(ADP)框架,能够根据输入数据智能配置参数,确保计算精度不低于原生硬件,同时带来性能提升,这对于科学计算和 AI 应用至关重要。

🎯 **精度与性能的智能平衡**: cuBLAS 的 ADP 框架是此次更新的核心亮点之一。它能够自动分析计算任务,判断是否可以安全地利用 FP 模拟来提高性能,并动态调整参数以保证精度。这种智能化的方法使得开发者无需深入了解底层细节,即可获得最佳的性能和精度组合,尤其是在处理 FP64 计算时,通过 Ozaki Scheme 和动态精度调整,克服了传统固定配置的局限性。

🔬 **实际应用中的显著效益**: 文章通过 ecTrans(天气预报模型)和 BerkeleyGW(材料科学计算)两个具体应用案例,展示了 FP 模拟带来的实际性能提升。ecTrans 在使用 FP32 模拟时实现了 2.4 倍的加速,并且数值精度与原生 FP32 相当甚至更优。BerkeleyGW 在使用 emulated FP64 ZGEMM 时,也获得了显著的性能增益,证明了 cuBLAS 新功能在加速复杂科学计算和研究方面的巨大潜力。

💡 **面向未来的发展方向**: NVIDIA 此次发布的 FP64 矩阵乘法模拟是该领域的一个重要里程碑,并预示着未来将有更多基于 FP 模拟的性能提升。通过利用 Blackwell BF16 和 INT8 Tensor Cores,以及持续优化的 FP 模拟算法,NVIDIA 正在不断推动高性能计算和 AI 领域的边界,为开发者提供更强大、更易用的工具。

NVIDIA CUDA-X math libraries provide the fundamental numerical building blocks that enable developers to deploy accelerated applications across multiple high-performance domains, including AI and scientific computing.

cuBLAS is a CUDA-X math library that consists of a highly optimized collection of basic linear algebra subroutines for matrix and vector operations that are specifically tuned to get the best possible performance across NVIDIA hardware using familiar and easy-to-use APIs.

The latest cuBLAS update in NVIDIA CUDA Toolkit 13.0 Update 2 introduces new APIs and implementations that significantly boost the performance of double-precision (FP64) matrix multiplications (matmuls). This is achieved through floating-point (FP) emulation on Tensor Cores found in GPU architectures such as NVIDIA GB200 NVL72 and NVIDIA RTX PRO 6000 Blackwell Server Edition. For comprehensive information on GPU compatibility for both FP32 and FP64 emulation, refer to the cuBLAS documentation

This new emulated FP64 matmul implementation complements the recently released single-precision (FP32) matmul emulation. Developers can fine-tune the required accuracy for FP64 matrix multiplications, but by default cuBLAS maintains accuracy equivalent to or better than native hardware. It automatically assesses whether an operation will perform better using FP emulation (with accuracy preserved) or native hardware and then selects the optimal implementation.

This post explains cuBLAS capabilities in CUDA Toolkit 13.0 Update 2, including:

    Seamless access to Tensor Core performance through familiar and straightforward developer APIsFP32 emulation with Blackwell BF16 tensor cores that provide increased performance over native FP32 matrix multiplication while preserving accuracyFP64 emulation with Blackwell INT8 tensor cores providing a safe, automatic performance increase with available fallback to native executionFP emulation for increased performance across a variety of software domains and hardware platforms

This is the first release of FP64 matmul emulation with more advancements to follow in upcoming releases.

Floating-point emulation in practice

The cuBLAS library exposes two flavors for matmul emulation: the BF16x9 algorithm for FP32 and the Ozaki Scheme for FP64. The BF16x9 algorithm provides a static decomposition that can be used to performantly and safely emulate all normal and subnormal FP32 values using Blackwell BF16 tensor cores. However, a common challenge of emulating FP64 with the Ozaki Scheme is that the numerics of the problem necessitate different representations.  

In other words, a single configuration cannot performantly and accurately emulate all FP64 values. Specifically, because the Ozaki Scheme uses a fixed-point representation for the operands after their exponents are aligned, the number of “mantissa bits” required is data dependent and must be greater than or equal to the 53 bits in the IEEE 754 FP64 representation to deliver the same or better accuracy.

To solve this problem, the cuBLAS library includes an automatic dynamic precision (ADP) framework which seamlessly analyzes inputs to determine if emulation can be safely leveraged for increased performance. If so, the emulation parameters are automatically configured to enable accuracy equal to or better than the native FP64 matmul.

Application results: ecTrans

When weather forecasting or climate modeling applications simulate the complex physics involved across the Earth’s atmosphere, oceans, and other systems, a grid is needed to discretize the domain and perform the calculations. The open source ecTrans library relies on linear algebra operations to perform the grid-based transformations that are used for the weather predictions of the Integrated Forecasting System (IFS)

As shown in Figure 1, using NVIDIA Blackwell Tensor Cores for FP32 emulation significantly improves performance in ecTrans by providing a 2.4x speedup to the matrix product computations. 

Figure 1. Performance is improved by using Blackwell BF16 Tensor Cores for FP32 emulation to reduce the amount of time spent computing matrix products in ecTrans

In addition to the increased performance, the numerical accuracy achieved with FP emulation is either equivalent or superior to the results when using native FP32. To validate this, 1,000 consecutive forward and backward transformations of the spectral transform onto real data fields from an actual simulation were repeated. 

During this process, the error distribution of velocities (U and V) and temperature (T) using BF16x9 FP emulation were tracked and compared to the results obtained when using standard FP32 precision (the operational precision used at the European Centre for Medium-Range Weather Forecasts for daily forecasts).

Figure 2. Repeated forward and backward iterations result in error distributions that show the numerical accuracy of SGEMMs using BF16x9 FP emulation to be as good as or better than native FP32 in ecTrans

The probability density functions of the absolute errors are shown in Figure 2 across FP32, TF32, and BF16x9 FP emulation. These plots correspond to the likelihood of encountering an error if velocities and temperatures are randomly sampled. The closer the curves are to a delta function centered at 0, the more accurate the underlying implementation. 

The results for TF32 are not present on the velocity plots due to the large error terms. Zooming out, large errors in the velocities and temperatures would become visible which demonstrates the sensitivity of weather modeling to precision. However, BF16x9 FP emulation not only has accuracy within acceptable ranges but shows the same or better accuracy when compared with native FP32, while exceeding the performance of FP32.

Application results: BerkeleyGW

The BerkeleyGW code is used by researchers to study physical properties of materials that emerge as a result of how electrons change energy states. It is a massively parallel code that has been used at full scale on leadership class supercomputers. Using GPUs with BerkeleyGW can lead to an 86x performance speedup over the CPU-only implementation and can be even further accelerated with FP emulation.  

Using emulated complex FP64 matmuls (ZGEMM) in the CHISUM routine of the BerkeleyGW Epsilon module allows for some flexibility in determining the optimal balance between accuracy and performance. By default, cuBLAS uses its ADP framework to determine the parameters that will guarantee results as accurate as using native FP64. This is done automatically for users and results in the performance gains shown in Figure 3.

Figure 3. Performance improvements in BerkeleyGW Epsilon module using Ozaki Scheme-based FP64 emulation for the ZGEMMs in the CHISUM calculation on Blackwell B200 compared to native FP64

However, the cuBLAS API enables the user to further fine-tune the performance by using fewer bits for the FP64 emulated operations. For BerkeleyGW, two cases were measured. FP emulation with the default ADP setting as well as with a manually-set 55 mantissa bits both resulted in accuracy well within widely accepted tolerances (10E-10) compared to the reference values, with the 55 mantissa bits case providing even more acceleration. 

The performance difference comes from ADP determining that more than 55 mantissa bits are required; however, the reduced precision with the manually set 55 mantissa bits does not have an impact on application-level accuracy for these tests. If more performance is desired, cuBLAS APIs enable you to adjust the precision used during emulation and explore if the resulting accuracy meets application needs.

Application results: Quantum Espresso

The open source Quantum Espresso (QE) collection of applications are used worldwide for materials science calculations based on density functional theory (DFT). The core of these applications is highly optimized for both scale-out distributed computation as well as for fine-grained parallelism within a node. 

QE depends on efficient double-precision GEMMs to apply operators during each step of the fundamental iteration cycle for determining ground state energies of atoms and materials. This double-precision GEMM usage is similar to many other DFT-based applications, and so the performance improvements for Quantum Espresso realized from FP emulation are expected to translate to many other DFT applications as well.

For the results shown in Figure 4, the Ausurf benchmark dataset was used to measure both the quality of the numerical results and the performance of QE with FP emulation enabled in the cuBLAS library on an RTX PRO 6000 Blackwell Server Edition GPU.

Figure 4. Performance of the Ausurf benchmark on RTX PRO 6000 Blackwell Server Edition across native FP64 and several configurations of emulated FP64

Figure 4 shows that FP emulation with ADP provides a significant 1.5x end-to-end speedup, and with further tuning to 39 mantissa bits, a nearly 3x end-to-end speedup is achieved.  For all configurations, the accuracy results are indistinguishable from one another until emulated FP64 with 39 mantissa bits are used. This produces application output values that are consistent up to 12 (base-10) significant digits. 

The performance difference between ADP and 55 mantissa bits is due to the ADP framework determining that more than 55 mantissa bits are required for IEEE 754 FP64 level accuracy; however, in practice, using fewer mantissa bits does not impact the measured application-level accuracy.

Benchmarking results: Heat maps

In addition to end-to-end application performance improvements due to FP emulation, it is important to understand the applicability range of emulation when analyzing how emulation can improve your application’s performance. The three heat maps shown in Figures 5-7 demonstrate the performance improvements from using emulated matmuls across different matrix shapes on a GB200 NVL72 GPU for FP32 and FP64 and on an RTX PRO 6000 Blackwell Server Edition for FP64.

Figure 6. Performance improvements on a GB200 NVL72 GPU across many GEMM shapes of FP64 emulation with ADP versus FP64
Figure 7. Performance improvements on RTX PRO 6000 Blackwell Server Edition across many GEMM shapes of FP64 emulation with ADP versus FP64

All three heat maps demonstrate substantial performance gains on moderate and large problem shapes. Additionally, in Figures 6 and 7, the ADP framework uses 55 mantissa bits and we can see that when the problems are too small to benefit from emulation, there are no performance penalties for attempting emulation due to cuBLAS heuristics choosing native FP64 algorithms.  We expect further improvements to performance and the applicability region in future cuBLAS releases.

What’s next for FP emulation

While FP emulation is already accelerating real applications, NVIDIA is continuing to advance and improve this technology across several key impact areas. Additional key BLAS level-3 and LAPACK routines within the CUDA-X math libraries will be accelerated through both FP32 and FP64 emulation. The team will continue to improve FP64 emulation with optimizations to the ADP framework, GEMM kernels, reduced workspace memory requirements, and with the Ozaki-II Scheme.

Using the strategies discussed in this post, you can take advantage of Tensor Core performance for algorithms that use matrix multiplication without changing your code or requiring tedious performance analysis. cuBLAS will automatically choose the best strategy, delivering high performance while preserving the desired level of accuracy.

To start using FP emulation and exploring its benefits in your own applications, download CUDA Toolkit 13.0 Update 2

To learn more, check out these related resources:

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

NVIDIA CUDA-X cuBLAS AI Scientific Computing Tensor Cores FP64 FP32 Performance Optimization ecTrans BerkeleyGW Deep Learning High-Performance Computing
相关文章