AI赋能天气预报：CorrDiff模型显著加速

As AI weather and climate prediction models rapidly gain adoption, the NVIDIA Earth-2 platform provides libraries and tools for accelerating solutions using a GPU-optimized software stack. Downscaling, which is the task of refining coarse-resolution (25km scale) weather data, enables national meteorological service (NMS) agencies to deliver high-resolution predictions for agriculture, energy, transportation, and disaster preparedness at spatial resolutions fine enough for actionable decision-making and planning.

Traditional dynamical downscaling is prohibitively expensive, especially for large ensembles at high resolution and over extensive spatial domains. CorrDiff, a generative AI downscaling model that sidesteps computational bottlenecks of traditional numerical methods, achieves state-of-the-art results, and uses a patch-based multidiffusion approach to scale to continental and global domains. This AI-based solution unlocks significant gains in efficiency and scalability compared to traditional numerical methods, while greatly reducing computational costs.

CorrDiff has gained global adoption for various use cases, demonstrating its versatility and impact across domains where fine-scale weather information is essential:

The Weather Company (TWC)

G42

Tomorrow.io

In this blog post, we show the performance optimizations and enhancements for CorrDiff training and inference that were incorporated into two tools in the Earth-2 stack, NVIDIA PhysicsNeMo and NVIDIA Earth2Studio. Achieving over 50x speedup on training and inference baselines, these optimizations enable:

Scaling patch-based training of the entire planet in under 3,000 GPU-hours.Lowering most country-scale trainings to O(100) GPU-hrs.Training over the contiguous United States (CONUS) in under 1000 GPU-hours.Fine-tuning and bespoke training that democratizes km-scale AI-weather.Country-scale inference in GPU-seconds, planetary-scale inference in GPU-minutes. Generating large ensembles affordably for high-resolution probabilistic forecasting.Interactive exploration of kilometer-scale data.

CorrDiff: Training and inference

*Figure 1: CorrDiff training and sampling workflow*

Figure 1 illustrates the training and sampling workflow of CorrDiff for generative downscaling. During diffusion training, a pretrained regression model is used to generate the conditional mean, which serves as input for training the diffusion model. For background and details on CorrDiff, refer to the CorrDiff publication, PhysicsNeMo docs, and source code.

Why optimize CorrDiff?

Diffusion models are resource-intensive because they rely on iterative sampling, with each denoising step involving multiple neural network computations. This makes inference time-consuming and costly. Training is also more expensive because the denoiser has to be trained for the full range of noise levels. Optimizing their performance requires:

Streamlining core operations (e.g., fusing kernels, using mixed precision, using NVIDIA CUDA graphs, etc.).Improving the sampling process by reducing the number of denoising steps and using optimal time integration schemes.

CorrDiff uses the EDM architecture, where several computationally expensive operations, such as group normalization, activation functions, and convolutions, can be optimized using advanced packages and kernel fusion.

CorrDiff also uses a two-stage pipeline (regression and correction), offering opportunities to amortize cost across multiple diffusion steps by caching regression outputs, minimizing redundant compute.

Accelerated CorrDiff

In the following figures, we summarize the various optimizations that result in over a 50x speedup on both training and inference costs over the CONUS domain. Figures 2 and 3 summarize the cumulative speedup factors achieved over the baseline with each successive optimization. Details of each optimization are provided in subsequent sections.

Figure 3. Patch-based CorrDiff Inference Speedup per sample, for CONUS-scale ~2M pixels (1056×1792). Green bars plot the speedup relative to the fp32 baseline performance (left y-axis) and grey solid line plots the sampling runtime per batch element in log scale (right y-axis)

Optimized CorrDiff: How it’s achieved

The baseline performance of CorrDiff on NVIDIA H100 GPUs with FP32 precision, batch size = 1, patch size = 1 (in absolute time) was as follows:

Regression forward:

1056 × 1792 pixels

Diffusion forward:

448 x 448 pixels

Diffusion backward:

While effective, this baseline was limited by expensive regression model forward passes and inefficient data transposes.

Figure 4. Training performance profile: NVIDIA Nsight Systems profile showing runtime distribution of baseline patch-based CorrDiff training stages, with regression forward dominating total iteration time

Key CorrDiff training optimizations

To achieve substantial acceleration in CorrDiff training, culminating in 53.86x speedup on NVIDIA B200 and 25.51x on H100, we introduced a series of performance optimizations outlined below.

Optimization 1: Enable AMP-BF16 for training
The original training recipe uses FP32 precision. Here, we enabled Automatic Mixed Precision (AMP) with BF16 for training to reduce memory usage and improve throughput without compromising numerical stability, leading to a 2.03x speedup over baseline.

Optimization 2: Amortizing regression cost using multi-iteration patching
In the original patch-based training workflow, each 448×448 patch sample for ‌diffusion model training required inference of the regression model for the full 1056×1792 CONUS domain. This caused ‌diffusion model training throughput to be bottlenecked by ‌regression model inference.

We improved efficiency by caching regression outputs and reusing them across 16 diffusion patches per timestamp. This provided broader spatial coverage while spreading regression costs more effectively, yielding a 12.33× speedup over ‌baseline.

Optimization 3: Eliminating data transposes with Apex GroupNorm
The training pipeline initially used the default NCHW memory layout, which triggers costly implicit memory transposes before/after convolutions. Switching the model and input tensors to NHWC (channels-last) format aligns them with cuDNN’s preferred layout. However, PyTorch GroupNorm ops do not support the channels-last format. To prevent transposes and keep data in channels-last format for more efficient normalization kernels, we replaced PyTorch GroupNorm with NVIDIA Apex GroupNorm. This eliminated transpose overhead and yielded a 16.71× speedup over the baseline.

Optimization 4: Fusing GroupNorm with SiLU
By fusing GroupNorm and SiLU activation into a single kernel using Apex, we reduced kernel launches and the number of global memory accesses. This increased GPU utilization and delivered a 17.15× speedup over the baseline.

Optimization 5: Extended channel dimension support in Apex GroupNorm
Some CorrDiff layers use channel dimensions unsupported by Apex. We extended support for these channel dimensions, unlocking fusion for all layers. This improved performance to 19.74× speedup over the baseline.

Optimization 6: Kernel fusion through graph compilation
We used torch.compile to fuse the remaining elementwise operations (e.g., addition, multiplication, sqrt, exp). This improved scheduling, reduced global memory accesses, and cut Python overhead, delivering speedup of 25.51× over the baseline.

Optimization 7: Apex GroupNorm V2 on NVIDIA Blackwell
Using Apex GroupNorm V2, optimized for NVIDIA Blackwell GPUs, yielded 53.86× speedup over the baseline on B200 and 2.1× over the H100-optimized workflow.

*Figure 5: Optimized patch-based CorrDiff training profile*

Training throughput

We compare the training throughput of baseline CorrDiff on NVIDIA Hopper against optimized versions on Hopper and Blackwell in Table 1. The optimized implementation achieves improvements in efficiency across both architectures, with Blackwell showing the most significant gains.

Note: Regression refers to the regression forward pass. Diffusion refers to the diffusion forward pass. Total includes the combined cost of (regression forward + diffusion forward + diffusion backward).

GPU	Version	Precision	Regression (ms/patch)	Diffusion (ms/patch)	Total runtime (ms/patch)	Throughput (patch/s)
H100	Baseline	FP32	1204.0	374.0	1578.0	0.63
H100	Optimized	BF16	10.609	51.25	61.859	16.2
B200	Optimized	BF16	4.734	24.56	29.297	34.1

Table 1. CorrDiff training throughput comparison

Speed-of-Light analysis

To evaluate how close our optimized CorrDiff workflow comes to the hardware performance ceiling, we conducted a Speed-of-Light (SOL) analysis on H100 and B200 GPUs. This provides an upper-bound estimate of achievable performance by assessing how effectively GPU resources are being used.

Steps to estimate SOL:

Identify low-utilization kernels:

DRAM read/write bandwidth < 60%

Tensor Core utilization < 60%

Estimate per-kernel potential:

Aggregate overall speedup:

Compute SOL efficiency:

Using this framework, our optimized CorrDiff workflow achieves 63% of the estimated SOL on H100 and 67% on B200. This indicates strong GPU utilization while still leaving headroom for future kernel-level improvements.

To further assess efficiency, we visualize kernel performance as shown in Figures 5 and 6. Each dot represents a kernel, plotted by NVIDIA Tensor Core utilization (x-axis) and combined DRAM read/write bandwidth utilization (y-axis). The dot size reflects its share of total runtime, highlighting performance-critical operations.

Kernels near the top or right edges are generally well-optimized, as they fully exploit compute or memory resources. In contrast, kernels in the bottom-left quadrant underutilize both and represent the best opportunities for further optimization. This visualization provides a clear picture of the runtime distribution and helps identify where GPU efficiency can be improved.

*Figure 6: Baseline FP32 patch-based CorrDiff kernel utilization on B200*

Figure 6 shows the distribution of kernels in terms of Tensor Core utilization and DRAM bandwidth utilization for the baseline CorrDiff implementation. In an unoptimized workflow with FP32 precision, >95% of time is spent in low-utilization kernels where both DRAM utilization (read + write) and tensor core utilization are below 60%.

The majority of runtime-dominant kernels cluster near the origin, showing very low DRAM and Tensor Core utilization. Only a small number of kernels lie near the upper or right boundaries, where kernels become clearly memory-bound or compute-bound. The unoptimized US CorrDiff workflow was only 1.23% at SOL on B200.

*Figure 7: Optimized BF16 patch-based CorrDiff kernel utilization on B200*

Figure 7 shows the distribution of kernels in the optimized implementation in terms of Tensor Core utilization and DRAM bandwidth utilization. In the optimized workflow with AMP-BF16 training, a higher proportion of kernels are near the top left or bottom right edges, indicating good performance and GPU utilization. Optimized CorrDiff is now 67% at SOL on B200. Despite the overall improvements in the optimized workflow, some kernels still have the potential to be accelerated further.

CorrDiff inference optimizations

Many of the training optimizations can also be applied to inference. We proposed several more inference-specific optimizations to maximize performance.

Optimized multi-diffusion
CorrDiff uses a patch-based multi-diffusion approach, where overlapping spatial patches are denoised and then aggregated. Initially, 27.1% of the total runtime was spent in im2col folding/unfolding operations. Precomputing overlap counts for each pixel and using torch.compile() to accelerate the remaining folding/unfolding steps eliminates the im2col bottleneck entirely, resulting in a speedup of 7.86x.

Deterministic Euler sampler (12 steps)
The original stochastic sampler used 18 denoising steps with the Heun solver and second-order correction. By enabling a deterministic sampler using the Euler solver (with no second-order correction), we reduced the number of denoising steps to 12 without impacting output quality. This change delivered an additional ~2.8× speedup on both Hopper and Blackwell. The ultimate speedup with a 12-step deterministic sampler is 21.94x on H100 and 54.87x on B200.

Several of the optimizations described in this blog post also apply to diffusion models in general, and some are specific to patch-based approaches. As such, those can be ported to other models in PhysicsNeMo and used in the development of solutions beyond weather downscaling.

Getting started

Train/inference CorrDiff in PhysicsNeMo: PhysicsNeMo CorrDiff documentation

To train with the optimized codebase, follow the instructions in the CorrDiff repo readme, and set the following options in the training.perf section in your selected training YAML config:

fp_optimizations: amp-bf16use_apex_gn: Truetorch_compile: Trueprofile_mode: False

To run inference with the optimized codebase, follow the instructions in the CorrDiff repo readme, and set the following options in the generation.perf section in your selected generation config:

use_fp16: Trueuse_apex_gn: Trueuse_torch_compile: Trueprofile_mode: Falseio_syncronous: True

profile_mode

repo

CFLAGS="-g0" NVCC_APPEND_FLAGS="--threads 8" \pip install \--no-build-isolation \--no-cache-dir \--disable-pip-version-check \--config-settings "--build-option=--group_norm" .

Learn more about optimized CorrDiff training in PhysicsNeMo and run optimized workflows in Earth2Studio.

Video 1. Visualizing patch-based CorrDiff downscaling on CONUS with 55x speedup

CorrDiff: Training and inference

Why optimize CorrDiff?

Accelerated CorrDiff

Optimized CorrDiff: How it’s achieved

Key CorrDiff training optimizations

Training throughput

Speed-of-Light analysis

CorrDiff inference optimizations

Getting started

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签