NVIDIA HPC SDK 25.7简化GPU编程，加速HPC应用

NVIDIA HPC SDK v25.7 delivers a significant leap forward for developers working on high-performance computing (HPC) applications with GPU acceleration. This release marks nearly two years of ongoing development focused on unified memory programming, resulting in a complete toolset that automates data movement between CPU and GPU. By eliminating much of the manual data management traditionally required, it streamlines GPU development, shortens GPU porting cycles, reduces bugs, and gives developers greater flexibility in optimizing scientific workloads.

Feature details

Coherent NVIDIA platforms with tightly coupled CPU and GPU architectures and unified address space between them are gaining momentum in the HPC market. Notable examples include the NVIDIA GH200 Grace Hopper Superchip and NVIDIA GB200 NVL72 systems, now deployed at the ALPS supercomputer at Swiss National Supercomputing Centre (CSCS) and JUPITER at the Jülich Supercomputing Centre (JSC).

These architectures are attractive not just for their performance, enabled by the high-bandwidth NVLink-C2C interconnect between CPU and GPU, but also for their boost to developer productivity. With a shared address space, programmers no longer need to manually manage data transfers between CPU and GPU, as data movement is automatically handled by the NVIDIA CUDA driver. This simplification is already proving valuable in real-world projects.

“Taking advantage of unified memory programming really allows us to move faster with the porting of the NEMO ocean model to GPUs. It also gives us the flexibility to experiment with running more workloads on GPUs compared to the traditional approach.”— Alexey Medvedev, Senior Research Engineer, Barcelona Supercomputing Center (BSC)

Since the introduction of the NVIDIA Grace Hopper Superchip Architecture in late 2023, NVIDIA HPC SDK has progressively added features tailored to unified memory programming. Some were announced in the blog post Simplifying GPU Programming for HPC with NVIDIA Grace Hopper Superchip.

Our NVIDIA HPC SDK 25.7 release introduces a complete toolset for simplifying—and in many cases, eliminating—manual data movement between CPUs and GPUs. This represents a major productivity boost for scientific application developers.

Data management is widely recognized as one of the most challenging aspects of GPU programming. Ensuring correct and efficient data flow between CPU and GPU subprograms often requires multiple iterations of debugging and optimization. In many HPC applications, this complexity is compounded by the use of dynamically allocated data and composite types. The following Fortran example illustrates a common pattern, often referred to as “deep copy”, where, to correctly parallelize a loop that accesses an allocatable array member of derived type, an additional loop must be introduced solely for managing data transfers using OpenACC directives.

type mytype integer, allocatable::arrmem(:) integer :: scalarmemend type mytype! Declare a new array of derived type - mytype.type(mytype)::base(:)…! This loop is exclusively for copying data to the GPU.!$acc enter data copyin(base(:))do i=1,N!$acc enter data copyin(base(i)%arrmem(:))end do! Parallel OpenACC loop accessing members of derived type mytype! in each array element of base.!$acc parallel loop collapse(2) default(present)do i=1,N do j=1,M   base(i)%arrmem(j)=base(i)%scalarmem end doend do

In real-world HPC applications, loops often access multiple arrays—some may be embedded within derived types, others declared in modules or common blocks. Managing data movement in the majority of realistic codes can be highly complex and, in many instances, more difficult than identifying and annotating the parallel loops themselves. Understanding the data dependencies, ensuring correct transfers, and avoiding memory leaks or race conditions all add significant overhead to GPU programming.

Much of this complexity can be eliminated with the unified memory model available on Grace Hopper and similar GPU architectures. Since both the CPU and GPU share a single address space, explicit data management is often no longer necessary. The earlier example involving an array of derived types can now be parallelized directly without additional data movement directives, as shown.

type mytype integer, allocatable::arrmem(:) integer :: scalarmemend type mytype! Declare a new array of derived type - mytype.type(mytype)::base(:)…! Parallel OpenACC loop accessing members of derived type mytype! in each array element of base.!$acc parallel loop collapse(2)do i=1,N do j=1,M   base(i)%arrmem(j)=base(i)%scalarmem end doend do

The data management challenge intensifies with certain C++ codes offloaded to GPUs. The extensive use of object-oriented abstraction and data encapsulation often prevents developers from accessing implementation internals to copy data to the GPU. For instance, the OpenACC code with std::vector cannot execute correctly without unified memory. The subscript operator of std::vector, used within the loop, necessitates access to both the data within the std::vector class and the data allocated for its elements, which are not contiguously located with the std::vector class itself.

std::vector<int> v(N);#pragma acc kernelsfor (i = 0; i < v.size(); i++) v[i] = i;

Without unified memory, simply adding the copyout(v) clause to the kernel’s construct in the example is insufficient. Only the std::vector object itself is copied, but not the elements it contains. As a result, such code is often rewritten to operate directly on the raw pointer to the allocation of vector elements, reverting to a non-object-oriented programming style, and copying data from most other STL containers to the GPU is not possible. This is due to the lack of an interface that provides access to the elements of these standard C++ containers.

std::vector<int> v(N);auto ptr = v.data();#pragma acc kernels copyout(ptr[0:N])for (i = 0; i < v.size(); i++) ptr[i] = i;

The Nucleus for European Modelling of the Ocean (NEMO) is an advanced modelling framework, used for research activities and forecasting services in ocean and climate sciences.

Before BSC initiated porting NEMO to GPUs, we conducted an internal evaluation of the publicly available codebase to explore the benefits of simplified unified memory programming.

In the NVIDIA GTC talks Accelerating Scientific Workflows With the NVIDIA Grace Hopper Platform and Accelerate Science With NVIDIA HPC Compilers, we used this real-world code as a case study to illustrate that developer productivity can be substantially boosted on coherent systems like Grace Hopper.

Unified memory eliminates the need for explicit data management code, enabling us to focus solely on parallelization. With less code, developers see speedups at an earlier phase of the GPU porting process. For this demonstration, we used the GYRE_PISCES benchmark from NEMO v4.2.0 on an ORCA ½ grid.

This is a memory bandwidth-bound benchmark, originally parallelized to multicore CPUs using MPI, with the main hotspots being the diffusion and advection of the active and passive tracers. We focused on these parts, simply parallelising loops in the performance-critical regions using OpenACC, and leaving the memory management to the CUDA driver and the hardware.

Our initial code porting strategy to the GPU was as follows, with no GPU data management code added:

!$acc parallel loop gang vector collapse()

!$acc loop seq

!$acc kernels

!$acc routine seq

Due to NEMO’s structure, where many functions contain multiple parallel regions often back-to-back, performance is impacted by implicit synchronizations at the end of each OpenACC parallel construct. Such synchronizations are added to simplify the programming and to preserve execution order between parallelized and sequential code, as in the original non-parallelized code.

To avoid these implicit barriers, async clauses were added to the parallel and kernels construct, enabling better concurrency. When running parallel regions in asynchronous mode, synchronization was introduced using !$acc wait either to ensure data computed on the GPU was available before subsequent MPI calls, or to prevent local variables from going out of scope before the end of the subroutines.

The following code fragment, taken from trazdf.f90 within the public open-source NEMO repository, demonstrates the OpenACC parallelization strategy described previously.

SUBROUTINE tra_zdf_imp(...) ... REAL(wp), DIMENSION(ntsi-(nn_hls):ntei+(nn_hls),ntsj-(nn_hls):ntej+(nn_hls),jpk) :: &                                                                    & zwi, zwt, zwd, zws ... DO jn = 1, kjpt   ...   !* 3d recurrence: Xk = ( Zk - Sk Xk+1 ) / Tk (result is the after tracer)   ! Fully parallel collapsed tightly nested OpenACC loops   !$acc parallel loop gang vector collapse(2) async(1)   DO jj = ntsj, ntej    DO ji = ntsi, ntei      pt(ji,jj,jpkm1,jn,Kaa) = pt(ji,jj,jpkm1,jn,Kaa)/zwt(ji,jj,jpkm1)*tmask(ji,jj,jpkm1)    END DO   END DO   !$acc end parallel   ! As above OpenACC parallel loop is with async clause,   ! no synchronization with the CPU here   !$acc parallel async(1)   ! Sequential OpenACC loop due to vertical dependencies   !$acc loop seq   DO jk =  jpk-2, 1, -1    ! Fully parallel collapsed tightly nested OpenACC loops    !$acc loop gang vector collapse(2)    DO jj = ntsj, ntej      DO ji = ntsi, ntei        pt(ji,jj,jk,jn,Kaa) = (pt(ji,jj,jk,jn,Kaa)-zws(ji,jj,jk)*pt(ji,jj,jk+1,jn,Kaa)) &                              & /zwt(ji,jj,jk)*tmask(ji,jj,jk)      END DO    END DO  END DO  !$acc end parallel  ! As above OpenACC parallel region is with async clause,  ! no synchronization with the CPU here.   ... END DO!$acc wait! As OpenACC wait enforces synchronization with the CPU,! the CPU waits here for all work to be completed on the GPU.END SUBROUTINE tra_zdf_imp

Further optimizing asynchronous execution led to data races in some parts of the code. This was particularly evident when GPU kernels asynchronously accessed shared data while the CPU was exiting functions where that shared data had been locally allocated. Fortunately, the OpenACC 3.4 specification, released in June 2025, introduced a capture modifier on existing data clauses that resolves these race conditions as detailed in Announcing OpenACC 3.4 at ISC 2025.

In HPC SDK 25.7, developers can safely manage data used asynchronously without refactoring large parts of their application. Moreover, the new features in the HPC SDK are designed not only to eliminate data races but also to tackle the often more challenging task of detecting them.

In the GPU port of NEMO, all memory management was automatically handled by the CUDA driver. On coherent platforms like Grace Hopper, system-allocated memory is accessible from both CPU and GPU, and the placement of the allocation follows the first-touch policy. Since CUDA 12.4, access-counter heuristics enable automatic migration of CPU memory pages to GPU memory for frequently accessed pages. See CUDA Toolkit 12.4 Enhances Support for NVIDIA Grace Hopper and Confidential Computing.

These automatic migrations can improve locality and performance. Similar functionality is supported on x86-hosted GPU systems with a sufficiently recent Linux kernel as detailed in Simplifying GPU Application Development with Heterogeneous Memory Management.

Thanks to high-bandwidth interconnects and dedicated hardware coherency mechanisms, Grace Hopper systems are expected to outperform unified memory systems that rely on Heterogeneous Memory Management (HMM). However, HMM-based systems continue to evolve, with ongoing kernel and driver improvements aimed at reducing latency and improving page migration efficiency across a broader range of GPU-accelerated platforms.

The following figure illustrates the incremental porting process of one timestep of the NEMO ocean model on a single Grace Hopper Superchip. Starting from a multicore CPU execution on Grace, we gradually move computations to the Hopper GPU, simply annotating loops with OpenACC as explained previously.

We port the horizontal diffusion of the active and passive tracers (Step 1), the tracer advection (Step 2), and then the vertical diffusion and time-filtering (Step 3). For every portion of code ported, the speedups from ~2x to ~5x were observed from executing on Hopper relative to the execution on Grace cores, and the overall speedup of ~2x was obtained on the partially accelerated simulation end-to-end.

Figure 1. Execution profile of a NEMO ocean model timestep on Grace Hopper, showing progressive GPU acceleration and speedup relative to multicore CPU execution for a code segment (straight arrows on the profile) and for the full simulation timestep (circular arrows on the right)

Even in the early stages of porting, we observed end-to-end speedups for the partially GPU-accelerated workload. Gradually offloading more components to the GPU further improves simulation performance. Unified memory simplifies memory management—frequently accessed CPU pages automatically migrate to GPU memory, enabling faster GPU kernel execution. CPU components maintain good performance, even when accessing GPU-resident data, thanks to the efficient remote access link. Since NEMO is an MPI-based code, we used multiple ranks on Grace to saturate the CPU memory and NVLink-C2C bandwidth, and enabled MPS on Hopper to reduce context switching overhead.

Overall, the performance gains achieved with relatively minimal effort highlight the promise of unified memory for accelerating scientific codes on the GPU. Seeing meaningful speedups very early in the GPU porting process, even with partially GPU-accelerated applications, is a notable shift in developer experience. That said, there remains significant room for further optimization, and we believe that with continued tuning and development, these speedups can be further improved.

To begin accelerating your applications with OpenACC and unified memory, download the NVIDIA HPC SDK today. For in-depth information on current capabilities, limitations, and upcoming updates, refer to the NVIDIA HPC SDK documentation.

Feature details

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签