Nvidia Developer 09月03日
NVIDIA HPC SDK 25.7简化GPU编程,加速HPC应用
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

NVIDIA HPC SDK 25.7版本为高性能计算(HPC)开发者带来了重大更新,重点在于统一内存编程。该工具集自动化了CPU与GPU之间的数据移动,显著减少了手动数据管理的需求,从而简化了GPU开发流程,缩短了GPU移植周期,降低了bug数量,并为开发者优化科学工作负载提供了更大灵活性。新版本特别强调了在NVIDIA Grace Hopper Superchip等架构上的应用,通过统一地址空间,开发者无需手动管理数据传输,极大地提升了开发效率,并在NEMO海洋模型等真实案例中验证了其优越性。

🚀 **统一内存编程简化GPU开发**:NVIDIA HPC SDK 25.7的核心亮点是其对统一内存编程的全面支持。通过自动化CPU与GPU间的数据移动,开发者可以摆脱繁琐的手动数据管理,极大地简化了GPU应用程序的开发和移植过程。这不仅缩短了开发周期,还减少了因数据管理不当而产生的错误,让开发者能更专注于算法优化。

💡 **提升开发者生产力与代码灵活性**:在NVIDIA Grace Hopper Superchip等采用CPU和GPU紧密耦合的架构上,统一地址空间意味着CPU和GPU共享同一内存视图。这意味着开发者无需显式地管理数据在CPU和GPU之间的数据拷贝,数据移动由CUDA驱动自动处理。这种简化极大地提高了开发者的生产力,并允许更灵活地调整工作负载以获得最佳性能。

🌊 **NEMO海洋模型实证优化**:文章以NEMO海洋模型为例,展示了HPC SDK 25.7在实际应用中的优势。通过利用统一内存编程, Barcelone Supercomputing Center (BSC) 的研究人员能够加速NEMO模型向GPU的移植过程,并能更自由地尝试将更多计算任务迁移到GPU上。即使是早期移植阶段,通过简单的OpenACC指令注解,也实现了显著的性能提升,验证了统一内存编程在加速科学计算方面的巨大潜力。

⚙️ **解决C++代码在GPU上的数据管理难题**:对于使用C++进行GPU编程的场景,HPC SDK 25.7通过统一内存解决了对象封装和数据访问的复杂性。例如,在处理`std::vector`时,无需再依赖原始指针或重写代码以暴露底层数据,可以直接在GPU上操作,避免了因数据无法正确传输而导致的代码重写或功能限制。

NVIDIA HPC SDK v25.7 delivers a significant leap forward for developers working on high-performance computing (HPC) applications with GPU acceleration. This release marks nearly two years of ongoing development focused on unified memory programming, resulting in a complete toolset that automates data movement between CPU and GPU. By eliminating much of the manual data management traditionally required, it streamlines GPU development, shortens GPU porting cycles, reduces bugs, and gives developers greater flexibility in optimizing scientific workloads.

Feature details

Coherent NVIDIA platforms with tightly coupled CPU and GPU architectures and unified address space between them are gaining momentum in the HPC market. Notable examples include the NVIDIA GH200 Grace Hopper Superchip and NVIDIA GB200 NVL72 systems, now deployed at the ALPS supercomputer at Swiss National Supercomputing Centre (CSCS) and JUPITER at the Jülich Supercomputing Centre (JSC). 

These architectures are attractive not just for their performance, enabled by the high-bandwidth NVLink-C2C interconnect between CPU and GPU, but also for their boost to developer productivity. With a shared address space, programmers no longer need to manually manage data transfers between CPU and GPU, as data movement is automatically handled by the NVIDIA CUDA driver. This simplification is already proving valuable in real-world projects.

“Taking advantage of unified memory programming really allows us to move faster with the porting of the NEMO ocean model to GPUs. It also gives us the flexibility to experiment with running more workloads on GPUs compared to the traditional approach.”— Alexey Medvedev, Senior Research Engineer, Barcelona Supercomputing Center (BSC)

Since the introduction of the NVIDIA Grace Hopper Superchip Architecture in late 2023, NVIDIA HPC SDK has progressively added features tailored to unified memory programming. Some were announced in the blog post Simplifying GPU Programming for HPC with NVIDIA Grace Hopper Superchip

Our NVIDIA HPC SDK 25.7 release introduces a complete toolset for simplifying—and in many cases, eliminating—manual data movement between CPUs and GPUs. This represents a major productivity boost for scientific application developers.

Data management is widely recognized as one of the most challenging aspects of GPU programming. Ensuring correct and efficient data flow between CPU and GPU subprograms often requires multiple iterations of debugging and optimization. In many HPC applications, this complexity is compounded by the use of dynamically allocated data and composite types. The following Fortran example illustrates a common pattern, often referred to as “deep copy”, where, to correctly parallelize a loop that accesses an allocatable array member of derived type, an additional loop must be introduced solely for managing data transfers using OpenACC directives.

type mytype integer, allocatable::arrmem(:) integer :: scalarmemend type mytype! Declare a new array of derived type - mytype.type(mytype)::base(:)…! This loop is exclusively for copying data to the GPU.!$acc enter data copyin(base(:))do i=1,N!$acc enter data copyin(base(i)%arrmem(:))end do! Parallel OpenACC loop accessing members of derived type mytype! in each array element of base.!$acc parallel loop collapse(2) default(present)do i=1,N do j=1,M   base(i)%arrmem(j)=base(i)%scalarmem end doend do

In real-world HPC applications, loops often access multiple arrays—some may be embedded within derived types, others declared in modules or common blocks. Managing data movement in the majority of realistic codes can be highly complex and, in many instances, more difficult than identifying and annotating the parallel loops themselves. Understanding the data dependencies, ensuring correct transfers, and avoiding memory leaks or race conditions all add significant overhead to GPU programming.

Much of this complexity can be eliminated with the unified memory model available on Grace Hopper and similar GPU architectures. Since both the CPU and GPU share a single address space, explicit data management is often no longer necessary. The earlier example involving an array of derived types can now be parallelized directly without additional data movement directives, as shown.

type mytype integer, allocatable::arrmem(:) integer :: scalarmemend type mytype! Declare a new array of derived type - mytype.type(mytype)::base(:)…! Parallel OpenACC loop accessing members of derived type mytype! in each array element of base.!$acc parallel loop collapse(2)do i=1,N do j=1,M   base(i)%arrmem(j)=base(i)%scalarmem end doend do

The data management challenge intensifies with certain C++ codes offloaded to GPUs. The extensive use of object-oriented abstraction and data encapsulation often prevents developers from accessing implementation internals to copy data to the GPU. For instance, the OpenACC code with std::vector cannot execute correctly without unified memory. The subscript operator of std::vector, used within the loop, necessitates access to both the data within the std::vector class and the data allocated for its elements, which are not contiguously located with the std::vector class itself.

std::vector<int> v(N);#pragma acc kernelsfor (i = 0; i < v.size(); i++) v[i] = i;

Without unified memory, simply adding the copyout(v) clause to the kernel’s construct in the example is insufficient. Only the std::vector object itself is copied, but not the elements it contains. As a result, such code is often rewritten to operate directly on the raw pointer to the allocation of vector elements, reverting to a non-object-oriented programming style, and copying data from most other STL containers to the GPU is not possible. This is due to the lack of an interface that provides access to the elements of these standard C++ containers.

std::vector<int> v(N);auto ptr = v.data();#pragma acc kernels copyout(ptr[0:N])for (i = 0; i < v.size(); i++) ptr[i] = i;

The Nucleus for European Modelling of the Ocean (NEMO) is an advanced modelling framework, used for research activities and forecasting services in ocean and climate sciences. 

Before BSC initiated porting NEMO to GPUs, we conducted an internal evaluation of the publicly available codebase to explore the benefits of simplified unified memory programming. 

In the NVIDIA GTC talks Accelerating Scientific Workflows With the NVIDIA Grace Hopper Platform and Accelerate Science With NVIDIA HPC Compilers, we used this real-world code as a case study to illustrate that developer productivity can be substantially boosted on coherent systems like Grace Hopper. 

Unified memory eliminates the need for explicit data management code, enabling us to focus solely on parallelization. With less code, developers see speedups at an earlier phase of the GPU porting process. For this demonstration, we used the GYRE_PISCES benchmark from NEMO v4.2.0 on an ORCA ½ grid. 

This is a memory bandwidth-bound benchmark, originally parallelized to multicore CPUs using MPI, with the main hotspots being the diffusion and advection of the active and passive tracers. We focused on these parts, simply parallelising loops in the performance-critical regions using OpenACC, and leaving the memory management to the CUDA driver and the hardware.

Our initial code porting strategy to the GPU was as follows, with no GPU data management code added:

    Fully parallel, tightly nested loops were parallelized using !$acc parallel loop gang vector collapse().Loops with cross-interaction dependencies were annotated using !$acc loop seq.Operations in array notation were wrapped inside !$acc kernels.External routines inside parallel loops were annotated using !$acc routine seq.

Due to NEMO’s structure, where many functions contain multiple parallel regions often back-to-back, performance is impacted by implicit synchronizations at the end of each OpenACC parallel construct. Such synchronizations are added to simplify the programming and to preserve execution order between parallelized and sequential code, as in the original non-parallelized code. 

To avoid these implicit barriers, async clauses were added to the parallel and kernels construct, enabling better concurrency. When running parallel regions in asynchronous mode, synchronization was introduced using !$acc wait either to ensure data computed on the GPU was available before subsequent MPI calls, or to prevent local variables from going out of scope before the end of the subroutines.

The following code fragment, taken from trazdf.f90 within the public open-source NEMO repository, demonstrates the OpenACC parallelization strategy described previously.

SUBROUTINE tra_zdf_imp(...) ... REAL(wp), DIMENSION(ntsi-(nn_hls):ntei+(nn_hls),ntsj-(nn_hls):ntej+(nn_hls),jpk) :: &                                                                    & zwi, zwt, zwd, zws ... DO jn = 1, kjpt   ...   !* 3d recurrence: Xk = ( Zk - Sk Xk+1 ) / Tk (result is the after tracer)   ! Fully parallel collapsed tightly nested OpenACC loops   !$acc parallel loop gang vector collapse(2) async(1)   DO jj = ntsj, ntej    DO ji = ntsi, ntei      pt(ji,jj,jpkm1,jn,Kaa) = pt(ji,jj,jpkm1,jn,Kaa)/zwt(ji,jj,jpkm1)*tmask(ji,jj,jpkm1)    END DO   END DO   !$acc end parallel   ! As above OpenACC parallel loop is with async clause,   ! no synchronization with the CPU here   !$acc parallel async(1)   ! Sequential OpenACC loop due to vertical dependencies   !$acc loop seq   DO jk =  jpk-2, 1, -1    ! Fully parallel collapsed tightly nested OpenACC loops    !$acc loop gang vector collapse(2)    DO jj = ntsj, ntej      DO ji = ntsi, ntei        pt(ji,jj,jk,jn,Kaa) = (pt(ji,jj,jk,jn,Kaa)-zws(ji,jj,jk)*pt(ji,jj,jk+1,jn,Kaa)) &                              & /zwt(ji,jj,jk)*tmask(ji,jj,jk)      END DO    END DO  END DO  !$acc end parallel  ! As above OpenACC parallel region is with async clause,  ! no synchronization with the CPU here.   ... END DO!$acc wait! As OpenACC wait enforces synchronization with the CPU,! the CPU waits here for all work to be completed on the GPU.END SUBROUTINE tra_zdf_imp

Further optimizing asynchronous execution led to data races in some parts of the code. This was particularly evident when GPU kernels asynchronously accessed shared data while the CPU was exiting functions where that shared data had been locally allocated. Fortunately, the OpenACC 3.4 specification, released in June 2025, introduced a capture modifier on existing data clauses that resolves these race conditions as detailed in Announcing OpenACC 3.4 at ISC 2025.

In HPC SDK 25.7, developers can safely manage data used asynchronously without refactoring large parts of their application. Moreover, the new features in the HPC SDK are designed not only to eliminate data races but also to tackle the often more challenging task of detecting them. 

In the GPU port of NEMO, all memory management was automatically handled by the CUDA driver. On coherent platforms like Grace Hopper, system-allocated memory is accessible from both CPU and GPU, and the placement of the allocation follows the first-touch policy. Since CUDA 12.4, access-counter heuristics enable automatic migration of CPU memory pages to GPU memory for frequently accessed pages. See CUDA Toolkit 12.4 Enhances Support for NVIDIA Grace Hopper and Confidential Computing

These automatic migrations can improve locality and performance. Similar functionality is supported on x86-hosted GPU systems with a sufficiently recent Linux kernel as detailed in Simplifying GPU Application Development with Heterogeneous Memory Management

Thanks to high-bandwidth interconnects and dedicated hardware coherency mechanisms, Grace Hopper systems are expected to outperform unified memory systems that rely on Heterogeneous Memory Management (HMM). However, HMM-based systems continue to evolve, with ongoing kernel and driver improvements aimed at reducing latency and improving page migration efficiency across a broader range of GPU-accelerated platforms.

The following figure illustrates the incremental porting process of one timestep of the NEMO ocean model on a single Grace Hopper Superchip. Starting from a multicore CPU execution on Grace, we gradually move computations to the Hopper GPU, simply annotating loops with OpenACC as explained previously. 

We port the horizontal diffusion of the active and passive tracers (Step 1), the tracer advection (Step 2), and then the vertical diffusion and time-filtering (Step 3). For every portion of code ported, the speedups from ~2x to ~5x were observed from executing on Hopper relative to the execution on Grace cores, and the overall speedup of ~2x was obtained on the partially accelerated simulation end-to-end. 

Figure 1. Execution profile of a NEMO ocean model timestep on Grace Hopper, showing progressive GPU acceleration and speedup relative to multicore CPU execution for a code segment (straight arrows on the profile) and for the full simulation timestep (circular arrows on the right)

Even in the early stages of porting, we observed end-to-end speedups for the partially GPU-accelerated workload. Gradually offloading more components to the GPU further improves simulation performance. Unified memory simplifies memory management—frequently accessed CPU pages automatically migrate to GPU memory, enabling faster GPU kernel execution. CPU components maintain good performance, even when accessing GPU-resident data, thanks to the efficient remote access link. Since NEMO is an MPI-based code, we used multiple ranks on Grace to saturate the CPU memory and NVLink-C2C bandwidth, and enabled MPS on Hopper to reduce context switching overhead.

Overall, the performance gains achieved with relatively minimal effort highlight the promise of unified memory for accelerating scientific codes on the GPU. Seeing meaningful speedups very early in the GPU porting process, even with partially GPU-accelerated applications, is a notable shift in developer experience. That said, there remains significant room for further optimization, and we believe that with continued tuning and development, these speedups can be further improved.

To begin accelerating your applications with OpenACC and unified memory, download the NVIDIA HPC SDK today. For in-depth information on current capabilities, limitations, and upcoming updates, refer to the NVIDIA HPC SDK documentation.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

NVIDIA HPC SDK GPU加速 高性能计算 统一内存 Grace Hopper OpenACC NEMO HPC CUDA NVIDIA GPU Programming
相关文章