CUDA Python 简介

一、CUDA Python 简介

CUDA Python 是 NVIDIA 推出的用于从 Python 访问 CUDA 平台的工具集，包含多个子组件，覆盖从低级别API到高级别Pythonic接口，适用于GPU加速计算、并行算法实现等场景。核心组件包括：

cuda.bindings

cuda.core

cuda.pathfinder

cuda.cccl.*

numba.cuda

二、安装指南

通过pip安装主要组件：

# 安装元包（包含多个子组件）pip install cuda-python# 单独安装子组件（按需选择）pip install cuda-core[cu12]  # 高级别Pythonic接口pip install cuda-bindings    # 低级别API绑定pip install cuda-pathfinder  # 组件定位工具

三、核心组件常用API及示例

1. `cuda.bindings`：低级别CUDA API绑定

提供与CUDA C API几乎1:1的映射，适合需要精细控制的场景。核心模块包括driver（驱动API）、runtime（运行时API）、nvrtc（运行时编译）等。

（1）`nvrtc`模块：运行时编译PTX

用于动态编译CUDA C代码为PTX（并行线程执行）中间代码。

常用API：

nvrtcCreateProgram

nvrtcCompileProgram

nvrtcGetPTX

nvrtcGetProgramLog

示例：编译并获取PTX

from cuda.bindings import nvrtcimport numpy as np# 定义CUDA核函数代码（SAXPY：y = a*x + y）kernel_code = """extern "C" __global__ void saxpy(float a, float *x, float *y, int n) {    int i = blockIdx.x * blockDim.x + threadIdx.x;    if (i < n) y[i] = a * x[i] + y[i];}"""# 1. 创建程序对象prog = nvrtc.nvrtcProgram()err, = nvrtc.nvrtcCreateProgram(    prog,               # 输出：程序对象    kernel_code,        # 源代码    "saxpy.cu",         # 文件名（用于日志）    0, None, None       # 其他参数（宏定义等）)assert err == nvrtc.nvrtcResult.NVRTC_SUCCESS, "创建程序失败"# 2. 编译程序（指定GPU架构，如compute_70）compile_opts = ["--gpu-architecture=compute_70"]err, = nvrtc.nvrtcCompileProgram(prog, len(compile_opts), compile_opts)if err != nvrtc.nvrtcResult.NVRTC_SUCCESS:    # 获取错误日志    log_size = nvrtc.nvrtcGetProgramLogSize(prog)[1]    log = nvrtc.nvrtcGetProgramLog(prog, log_size)[1]    raise RuntimeError(f"编译失败：{log.decode()}")# 3. 获取PTX代码ptx_size = nvrtc.nvrtcGetPTXSize(prog)[1]ptx = nvrtc.nvrtcGetPTX(prog, ptx_size)[1].decode()print("PTX代码：\n", ptx[:500])  # 打印前500字符

（2）`driver`模块：设备管理与核函数执行

用于管理GPU设备、内存分配、加载PTX并执行核函数。

常用API：

cuInit

cuDeviceGet

cuCtxCreate

cuMemAlloc

cuMemFree

cuMemcpyHtoD

cuMemcpyDtoH

cuModuleLoadData

cuModuleGetFunction

cuLaunchKernel

示例：执行SAXPY核函数

from cuda.bindings import driverimport numpy as np# 初始化驱动driver.cuInit(0)# 获取设备并创建上下文device = driver.CUdevice()driver.cuDeviceGet(device, 0)  # 获取第0号GPUctx = driver.CUcontext()driver.cuCtxCreate(ctx, 0, device)# 准备主机数据n = 1024a = 2.0x = np.random.rand(n).astype(np.float32)y = np.random.rand(n).astype(np.float32)expected = a * x + y  # 预期结果# 分配设备内存x_dev = driver.CUdeviceptr()y_dev = driver.CUdeviceptr()driver.cuMemAlloc(x_dev, n * np.dtype(np.float32).itemsize)driver.cuMemAlloc(y_dev, n * np.dtype(np.float32).itemsize)# 主机到设备的数据拷贝driver.cuMemcpyHtoD(x_dev, x.ctypes.data, n * np.dtype(np.float32).itemsize)driver.cuMemcpyHtoD(y_dev, y.ctypes.data, n * np.dtype(np.float32).itemsize)# 从PTX加载模块（使用前一步生成的ptx）module = driver.CUmodule()driver.cuModuleLoadData(module, ptx.encode())# 获取核函数kernel = driver.CUfunction()driver.cuModuleGetFunction(kernel, module, "saxpy")# 配置核函数参数（a, x_dev, y_dev, n）args = [    np.float32(a),    x_dev,    y_dev,    np.int32(n)]# 计算网格和块大小（每个块256线程）block_dim = (256, 1, 1)grid_dim = ((n + block_dim[0] - 1) // block_dim[0], 1, 1)# 执行核函数driver.cuLaunchKernel(    kernel,    grid_dim[0], grid_dim[1], grid_dim[2],  # 网格大小    block_dim[0], block_dim[1], block_dim[2],  # 块大小    0,  # 共享内存大小    None,  # 流（默认）    args,  # 参数列表    None   # 额外参数)driver.cuCtxSynchronize()  # 等待核函数执行完成# 设备到主机的数据拷贝（获取结果）y_result = np.empty_like(y)driver.cuMemcpyDtoH(y_result.ctypes.data, y_dev, n * np.dtype(np.float32).itemsize)# 验证结果assert np.allclose(y_result, expected, atol=1e-5), "结果不匹配"print("SAXPY执行成功！")# 释放资源driver.cuMemFree(x_dev)driver.cuMemFree(y_dev)driver.cuModuleUnload(module)driver.cuCtxDestroy(ctx)

2. `cuda.core`：高级别Pythonic接口

提供更简洁的API，封装了设备管理、内存操作、核函数编译等功能，适合快速开发。

常用类及方法：

Device

Device(0)

Stream

Event

Program

LaunchConfig

示例：用cuda.core实现SAXPY

from cuda.core import Device, Program, LaunchConfigimport numpy as np# 获取设备并设置为当前设备device = Device(0)device.use()# 准备数据n = 1024a = 2.0x = np.random.rand(n).astype(np.float32)y = np.random.rand(n).astype(np.float32)expected = a * x + y# 设备内存分配（自动与主机数据同步）x_dev = device.mem_alloc(x.nbytes)y_dev = device.mem_alloc(y.nbytes)x_dev.copy_from_host(x)y_dev.copy_from_host(y)# 编译核函数（自动处理NVRTC流程）kernel_code = """extern "C" __global__ void saxpy(float a, float *x, float *y, int n) {    int i = blockIdx.x * blockDim.x + threadIdx.x;    if (i < n) y[i] = a * x[i] + y[i];}"""program = Program(kernel_code, "saxpy.cu")program.compile(["--gpu-architecture=compute_70"])  # 编译# 配置核函数启动参数launch_config = LaunchConfig(    grid=(n + 255) // 256,  # 网格大小    block=(256,),           # 块大小)# 执行核函数（自动处理参数传递）program.saxpy[launch_config](a, x_dev, y_dev, n)device.synchronize()  # 等待完成# 获取结果y_result = np.empty_like(y)y_dev.copy_to_host(y_result)# 验证assert np.allclose(y_result, expected, atol=1e-5)print("cuda.core SAXPY执行成功！")

3. `cuda.pathfinder`：CUDA组件定位

用于查找系统中的CUDA动态库、头文件等组件，适合跨环境部署。

常用API：

find_cuda_library

cudart

find_cuda_include_dirs

示例：

from cuda.pathfinder import find_cuda_library, find_cuda_include_dirs# 查找CUDA运行时库cudart_path = find_cuda_library("cudart")print(f"CUDA Runtime库路径：{cudart_path}")# 查找CUDA头文件目录include_dirs = find_cuda_include_dirs()print(f"CUDA头文件目录：{include_dirs}")

四、进阶场景：使用`cuda.cccl.parallel`进行并行算法

cuda.cccl.parallel封装了CCCL库的高效并行算法（如排序、归约），可直接在主机调用。

示例：并行归约（求和）

from cuda.cccl.parallel import reduceimport numpy as np# 生成随机数据data = np.random.rand(1024 * 1024).astype(np.float32)expected_sum = data.sum()# 使用CCCL的reduce进行并行求和result = reduce(data, 0.0, lambda a, b: a + b)  # 初始值0.0，求和操作assert np.isclose(result, expected_sum), "归约结果不匹配"print(f"并行求和结果：{result}（预期：{expected_sum}）")

五、总结

CUDA Python 提供了从低级别到高级别的完整接口：

cuda.bindings

cuda.core

cuda.cccl.parallel

cuda.pathfinder

更多细节可参考官方文档：

cuda.core 文档

cuda.bindings 文档

一、CUDA Python 简介

二、安装指南

三、核心组件常用API及示例

1. `cuda.bindings`：低级别CUDA API绑定

（1）`nvrtc`模块：运行时编译PTX

（2）`driver`模块：设备管理与核函数执行

2. `cuda.core`：高级别Pythonic接口

3. `cuda.pathfinder`：CUDA组件定位

四、进阶场景：使用`cuda.cccl.parallel`进行并行算法

五、总结

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签

一、CUDA Python 简介

二、安装指南

三、核心组件常用API及示例

1. cuda.bindings：低级别CUDA API绑定

（1）nvrtc模块：运行时编译PTX

（2）driver模块：设备管理与核函数执行

2. cuda.core：高级别Pythonic接口

3. cuda.pathfinder：CUDA组件定位

四、进阶场景：使用cuda.cccl.parallel进行并行算法

五、总结

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签

1. `cuda.bindings`：低级别CUDA API绑定

（1）`nvrtc`模块：运行时编译PTX

（2）`driver`模块：设备管理与核函数执行

2. `cuda.core`：高级别Pythonic接口

3. `cuda.pathfinder`：CUDA组件定位

四、进阶场景：使用`cuda.cccl.parallel`进行并行算法