Nvidia Developer 11月11日 03:47
AI赋能天气预报:CorrDiff模型显著加速
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

文章介绍了NVIDIA Earth-2平台如何通过优化的软件栈加速AI天气和气候预测模型。特别是,CorrDiff这一生成式AI降尺度模型,通过创新的方法克服了传统数值方法的计算瓶颈,实现了高分辨率天气数据的精细化预测。通过对CorrDiff的训练和推理进行性能优化,包括AMP-BF16、多迭代打补丁、Apex GroupNorm、核函数融合和torch.compile等技术,显著提升了计算效率。这些优化使得行星尺度训练在不到3000 GPU小时内完成,国家尺度训练降至百余GPU小时,并实现了行星尺度推理在GPU分钟级完成。文章还通过Speed-of-Light分析展示了优化后的GPU利用率,并提供了获取和使用优化代码的指导。

🚀 **AI驱动的天气降尺度技术突破:** CorrDiff作为一种生成式AI模型,通过其创新的多扩散方法,有效解决了传统动力学降尺度在处理大规模高分辨率天气数据时面临的计算成本高昂的问题。它能够以更高的效率和可扩展性生成精细化的天气预测,为农业、能源、交通和灾害预警等关键领域提供支持。

⚡ **性能优化实现跨代际提速:** 通过引入一系列先进的性能优化技术,包括自动混合精度(AMP-BF16)、多迭代打补丁以摊销回归成本、使用NVIDIA Apex GroupNorm消除数据转置、以及通过torch.compile进行核函数融合等,CorrDiff的训练和推理速度实现了显著提升。在NVIDIA Blackwell GPU上,训练速度比基线快53.86倍,国家尺度训练时间降至百余GPU小时,行星尺度推理仅需GPU分钟级。

💡 **GPU效率最大化与可访问性提升:** 通过Speed-of-Light分析和核函数利用率可视化,文章展示了优化后的CorrDiff工作流能够高效利用GPU资源,达到硬件性能上限的63%-67%。这些优化工作极大地降低了AI天气模型的计算门槛,使得公里级AI天气预报的训练和定制化成为可能,从而加速了AI在气象领域的普及和应用。

🌍 **多领域应用与广泛采纳:** CorrDiff模型已在全球范围内获得广泛应用,包括The Weather Company(TWC)在农业、能源和航空领域的支持;G42用于改善中东地区的雾霾和沙尘暴预测;以及Tomorrow.io在增强风暴尺度预测(如火灾天气预报和影响铁路运营的风速预测)方面的应用,体现了其在精细化天气信息需求场景下的强大通用性和实际价值。

As AI weather and climate prediction models rapidly gain adoption, the NVIDIA Earth-2 platform provides libraries and tools for accelerating solutions using a GPU-optimized software stack. Downscaling, which is the task of refining coarse-resolution (25km scale) weather data, enables national meteorological service (NMS) agencies to deliver high-resolution predictions for agriculture, energy, transportation, and disaster preparedness at spatial resolutions fine enough for actionable decision-making and planning. 

Traditional dynamical downscaling is prohibitively expensive, especially for large ensembles at high resolution and over extensive spatial domains. CorrDiff, a generative AI downscaling model that sidesteps computational bottlenecks of traditional numerical methods, achieves state-of-the-art results, and uses a patch-based multidiffusion approach to scale to continental and global domains. This AI-based solution unlocks significant gains in efficiency and scalability compared to traditional numerical methods, while greatly reducing computational costs.

CorrDiff has gained global adoption for various use cases, demonstrating its versatility and impact across domains where fine-scale weather information is essential:

    The Weather Company (TWC) for supporting the agriculture, energy, and aviation industries.G42 for improving smog and dust storm predictions in the Middle East.Tomorrow.io for enhancing a range of storm-scale predictions, including fire weather forecasts and wind gust forecasts that disrupt railway operations.

In this blog post, we show the performance optimizations and enhancements for CorrDiff training and inference that were incorporated into two tools in the Earth-2 stack, NVIDIA PhysicsNeMo and NVIDIA Earth2Studio. Achieving over 50x speedup on training and inference baselines, these optimizations enable:

    Scaling patch-based training of the entire planet in under 3,000 GPU-hours.Lowering most country-scale trainings to O(100) GPU-hrs.Training over the contiguous United States (CONUS) in under 1000 GPU-hours.Fine-tuning and bespoke training that democratizes km-scale AI-weather.Country-scale inference in GPU-seconds, planetary-scale inference in GPU-minutes. Generating large ensembles affordably for high-resolution probabilistic forecasting.Interactive exploration of kilometer-scale data.

CorrDiff: Training and inference 

Figure 1: CorrDiff training and sampling workflow

Figure 1 illustrates the training and sampling workflow of CorrDiff for generative downscaling. During diffusion training, a pretrained regression model is used to generate the conditional mean, which serves as input for training the diffusion model. For background and details on CorrDiff, refer to the CorrDiff publication, PhysicsNeMo docs, and source code

Why optimize CorrDiff? 

Diffusion models are resource-intensive because they rely on iterative sampling, with each denoising step involving multiple neural network computations. This makes inference time-consuming and costly. Training is also more expensive because the denoiser has to be trained for the full range of noise levels. Optimizing their performance requires:

    Streamlining core operations (e.g., fusing kernels, using mixed precision, using NVIDIA CUDA graphs, etc.).Improving the sampling process by reducing the number of denoising steps and using optimal time integration schemes.

CorrDiff uses the EDM architecture, where several computationally expensive operations, such as group normalization, activation functions, and convolutions, can be optimized using advanced packages and kernel fusion.

CorrDiff also uses a two-stage pipeline (regression and correction), offering opportunities to amortize cost across multiple diffusion steps by caching regression outputs, minimizing redundant compute.

Accelerated CorrDiff

In the following figures, we summarize the various optimizations that result in over a 50x speedup on both training and inference costs over the CONUS domain. Figures 2 and 3 summarize the cumulative speedup factors achieved over the baseline with each successive optimization. Details of each optimization are provided in subsequent sections. 

Figure 3. Patch-based CorrDiff Inference Speedup per sample, for CONUS-scale ~2M pixels (1056×1792). Green bars plot the speedup relative to the fp32 baseline performance (left y-axis) and grey solid line plots the sampling runtime per batch element in log scale (right y-axis)

Optimized CorrDiff: How it’s achieved

The baseline performance of CorrDiff on NVIDIA H100 GPUs with FP32 precision, batch size = 1, patch size = 1 (in absolute time) was as follows:

    Regression forward: 1204ms
      Domain:  CONUS of size 1056 × 1792 pixelsInput channels: [“u500”, “v500”, “z500”, “t500”, “u850”, “v850”, “z850”, “t850”, “u10m”, “v10m”, “t2m”, “tcwv”] at 25km resolution Output channels:  [“refc”, “2t”, “10u”, “10v”] at 2km resolution 
    Diffusion forward: 155ms
      Domain: spatial patch of size 448 x 448 pixelsInput channels: [“u500”, “v500”, “z500”, “t500”, “u850”, “v850”, “z850”, “t850”, “u10m”, “v10m”, “t2m”, “tcwv”] at 25km resolution Output channels:  [“refc”, “2t”, “10u”, “10v”] at 2 km resolution 
    Diffusion backward: 219ms

While effective, this baseline was limited by expensive regression model forward passes and inefficient data transposes.

Figure 4. Training performance profile: NVIDIA Nsight Systems profile showing runtime distribution of baseline patch-based CorrDiff training stages, with regression forward dominating total iteration time

Key CorrDiff training optimizations 

To achieve substantial acceleration in CorrDiff training, culminating in 53.86x speedup on NVIDIA B200 and 25.51x on H100, we introduced a series of performance optimizations outlined below.

Optimization 1: Enable AMP-BF16 for training
The original training recipe uses FP32 precision. Here, we enabled Automatic Mixed Precision (AMP) with BF16 for training to reduce memory usage and improve throughput without compromising numerical stability, leading to a 2.03x speedup over baseline.

Optimization 2: Amortizing regression cost using multi-iteration patching
In the original patch-based training workflow, each 448×448 patch sample for ‌diffusion model training required inference of the regression model for the full 1056×1792 CONUS domain. This caused ‌diffusion model training throughput to be bottlenecked by ‌regression model inference.

We improved efficiency by caching regression outputs and reusing them across 16 diffusion patches per timestamp. This provided broader spatial coverage while spreading regression costs more effectively, yielding a 12.33× speedup over ‌baseline.

Optimization 3: Eliminating data transposes with Apex GroupNorm
The training pipeline initially used the default NCHW memory layout, which triggers costly implicit memory transposes before/after convolutions. Switching the model and input tensors to NHWC (channels-last) format aligns them with cuDNN’s preferred layout. However, PyTorch GroupNorm ops do not support the channels-last format. To prevent transposes and keep data in channels-last format for more efficient normalization kernels, we replaced PyTorch GroupNorm with NVIDIA Apex GroupNorm. This eliminated transpose overhead and yielded a 16.71× speedup over the baseline.

Optimization 4: Fusing GroupNorm with SiLU
By fusing GroupNorm and SiLU activation into a single kernel using Apex, we reduced kernel launches and the number of global memory accesses. This increased GPU utilization and delivered a 17.15× speedup over the baseline.

Optimization 5: Extended channel dimension support in Apex GroupNorm
Some CorrDiff layers use channel dimensions unsupported by Apex. We extended support for these channel dimensions, unlocking fusion for all layers. This improved performance to 19.74× speedup over the baseline.

Optimization 6: Kernel fusion through graph compilation
We used torch.compile to fuse the remaining elementwise operations (e.g., addition, multiplication, sqrt, exp). This improved scheduling, reduced global memory accesses, and cut Python overhead, delivering speedup of 25.51× over the baseline.

Optimization 7: Apex GroupNorm V2 on NVIDIA Blackwell
Using Apex GroupNorm V2, optimized for NVIDIA Blackwell GPUs, yielded 53.86× speedup over the baseline on B200 and 2.1× over the H100-optimized workflow.

Figure 5: Optimized patch-based CorrDiff training profile

Training throughput 

We compare the training throughput of baseline CorrDiff on NVIDIA Hopper against optimized versions on Hopper and Blackwell in Table 1. The optimized implementation achieves improvements in efficiency across both architectures, with Blackwell showing the most significant gains.

Note: Regression refers to the regression forward pass. Diffusion refers to the diffusion forward pass. Total includes the combined cost of (regression forward + diffusion forward + diffusion backward).

GPU VersionPrecisionRegression (ms/patch)Diffusion (ms/patch)Total runtime (ms/patch)Throughput (patch/s)
H100Baseline FP321204.0374.01578.00.63
H100 OptimizedBF1610.60951.2561.85916.2
B200 OptimizedBF164.73424.5629.29734.1
Table 1. CorrDiff training throughput comparison

Speed-of-Light analysis 

To evaluate how close our optimized CorrDiff workflow comes to the hardware performance ceiling, we conducted a Speed-of-Light (SOL) analysis on H100 and B200 GPUs. This provides an upper-bound estimate of achievable performance by assessing how effectively GPU resources are being used.

Steps to estimate SOL:

    Identify low-utilization kernels:
    We focus on kernels with both DRAM read/write bandwidth < 60% and Tensor Core utilization < 60%. Such kernels are neither memory-bound nor compute-bound, making them likely performance bottlenecks.Estimate per-kernel potential:
    For each low-utilization kernel, we estimate the potential speedup under ideal conditions—namely, full DRAM bandwidth usage or full Tensor Core activity.Aggregate overall speedup:
    We then compute the hypothetical end-to-end speedup if each kernel were optimized to its ideal performance.Compute SOL efficiency:
    Finally, we estimate the fraction of theoretical maximum SOL as the fraction of peak performance achievable if the top 10 runtime-dominant kernels were individually boosted to their theoretical maximum.

Using this framework, our optimized CorrDiff workflow achieves 63% of the estimated SOL on H100 and 67% on B200. This indicates strong GPU utilization while still leaving headroom for future kernel-level improvements.

To further assess efficiency, we visualize kernel performance as shown in Figures 5 and 6. Each dot represents a kernel, plotted by NVIDIA Tensor Core utilization (x-axis) and combined DRAM read/write bandwidth utilization (y-axis). The dot size reflects its share of total runtime, highlighting performance-critical operations. 

Kernels near the top or right edges are generally well-optimized, as they fully exploit compute or memory resources. In contrast, kernels in the bottom-left quadrant underutilize both and represent the best opportunities for further optimization. This visualization provides a clear picture of the runtime distribution and helps identify where GPU efficiency can be improved.

Figure 6: Baseline FP32 patch-based CorrDiff kernel utilization on B200

Figure 6 shows the distribution of kernels in terms of Tensor Core utilization and DRAM bandwidth utilization for the baseline CorrDiff implementation. In an unoptimized workflow with FP32 precision, >95% of time is spent in low-utilization kernels where both DRAM utilization (read + write) and tensor core utilization are below 60%.

The majority of runtime-dominant kernels cluster near the origin, showing very low DRAM and Tensor Core utilization. Only a small number of kernels lie near the upper or right boundaries, where kernels become clearly memory-bound or compute-bound. The unoptimized US CorrDiff workflow was only 1.23% at SOL on B200.

Figure 7: Optimized BF16 patch-based CorrDiff kernel utilization on B200

Figure 7 shows the distribution of kernels in the optimized implementation in terms of Tensor Core utilization and DRAM bandwidth utilization. In the optimized workflow with AMP-BF16 training, a higher proportion of kernels are near the top left or bottom right edges, indicating good performance and GPU utilization. Optimized CorrDiff is now 67% at SOL on B200. Despite the overall improvements in the optimized workflow, some kernels still have the potential to be accelerated further.

CorrDiff inference optimizations 

Many of the training optimizations can also be applied to inference. We proposed several more inference-specific optimizations to maximize performance.

Optimized multi-diffusion
CorrDiff uses a patch-based multi-diffusion approach, where overlapping spatial patches are denoised and then aggregated. Initially, 27.1% of the total runtime was spent in im2col folding/unfolding operations. Precomputing overlap counts for each pixel and using torch.compile() to accelerate the remaining folding/unfolding steps eliminates the im2col bottleneck entirely, resulting in a speedup of 7.86x.

Deterministic Euler sampler (12 steps) 
The original stochastic sampler used 18 denoising steps with the Heun solver and second-order correction. By enabling a deterministic sampler using the Euler solver (with no second-order correction), we reduced the number of denoising steps to 12 without impacting output quality. This change delivered an additional ~2.8× speedup on both Hopper and Blackwell. The ultimate speedup with a 12-step deterministic sampler is 21.94x on H100 and 54.87x on B200.

Several of the optimizations described in this blog post also apply to diffusion models in general, and some are specific to patch-based approaches. As such, those can be ported to other models in PhysicsNeMo and used in the development of solutions beyond weather downscaling. 

Getting started

Train/inference CorrDiff in PhysicsNeMo: PhysicsNeMo CorrDiff documentation 

    To train with the optimized codebase, follow the instructions in the CorrDiff repo readme, and set the following options in the training.perf section in your selected training YAML config:
fp_optimizations: amp-bf16use_apex_gn: Truetorch_compile: Trueprofile_mode: False
    To run inference with the optimized codebase, follow the instructions in the CorrDiff repo readme, and set the following options in the generation.perf section in your selected generation config:
use_fp16: Trueuse_apex_gn: Trueuse_torch_compile: Trueprofile_mode: Falseio_syncronous: True
    Set profile_mode to False for optimized performance, as the NVTX annotations would introduce graph breaks to torch.compile workflow.To utilize the latest Apex GroupNorm kernels, either build Apex GroupNorm in PhysicsNeMo container Dockerfile or build it locally after loading the PhysicsNeMo container
      Clone the Apex repo and build using:
CFLAGS="-g0" NVCC_APPEND_FLAGS="--threads 8" \pip install \--no-build-isolation \--no-cache-dir \--disable-pip-version-check \--config-settings "--build-option=--group_norm" .

Learn more about optimized CorrDiff training in PhysicsNeMo and run optimized workflows in Earth2Studio.

Video 1. Visualizing patch-based CorrDiff downscaling on CONUS with 55x speedup

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI天气预报 NVIDIA Earth-2 CorrDiff AI降尺度 GPU加速 高性能计算 气象预测 Deep Learning Weather Forecasting NVIDIA Earth-2 CorrDiff AI Downscaling GPU Acceleration High-Performance Computing Meteorological Prediction
相关文章