Nvidia Developer 09月21日
RAPIDS 25.08版本更新:提升数据科学加速与易用性
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

NVIDIA RAPIDS 25.08版本发布,带来了多项重要更新,旨在进一步提升加速数据科学的可访问性和可扩展性。新版本引入了两个用于cuml.accel代码的全新性能剖析工具,帮助用户识别GPU与CPU的运算分配及耗时,从而优化机器学习工作流。Polars GPU引擎现已支持更大的数据集处理,默认启用的流式执行器能有效处理超出GPU显存的数据。此外,Polars GPU引擎在结构体数据和字符串操作方面也得到了增强。cuML新增了Spectral Embedding、LinearSVC、LinearSVR和KernelRidge等算法支持。值得注意的是,此版本起不再支持CUDA 11。

✨ **性能剖析工具增强**:RAPIDS 25.08版本为cuml.accel引入了函数级和行级性能剖析工具。这些工具能够清晰展示代码在GPU和CPU上的执行情况、调用次数及耗时,帮助开发者快速定位性能瓶颈,优化机器学习模型的运行效率,尤其是在处理混合加速和CPU回退的场景时。

🚀 **Polars GPU引擎扩展**:Polars GPU引擎在25.08版本中得到显著增强,其默认启用的流式执行器能够高效处理远超GPU显存容量的数据集,大幅提升了大规模数据处理的性能和可扩展性。同时,对结构体数据类型和更多字符串操作的支持,进一步丰富了其在复杂数据处理场景下的应用能力。

💡 **cuML算法库扩充**:cuML在本次更新中新增了Spectral Embedding、LinearSVC、LinearSVR和KernelRidge等算法。Spectral Embedding为降维和流形学习提供了新选择,而LinearSVC、LinearSVR和KernelRidge的加入则意味着更多线性模型和回归算法可以在cuML中实现零代码改动加速,降低了用户使用门槛。

⚠️ **CUDA 11支持终止**:自RAPIDS 25.08版本起,官方将停止对CUDA 11的支持,包括相关的容器、已发布包以及源码构建。用户若需继续使用CUDA 11,可选择锁定在RAPIDS 25.06版本。

The 25.08 release of RAPIDS continues to push the boundaries toward making accelerated data science more accessible and scalable with the addition of several new features, including:

    Two new profiling tools for troubleshooting cuml.accel codeSupport for larger, more complex data in the Polars GPU engineNew algorithm support in cuML and cuml.accelAn update on CUDA version support

Learn more about the new features below.

The 25.08 release brings the addition of two new profiling options to cuml.accel. Similar to the profiler that was previously released for cudf.pandas, these new profiling features help users understand which operations are accelerated by cuML on the GPU, which fall back to running on the CPU, and how long these operations take. This can be useful for users trying to understand current performance bottlenecks in their machine learning workflows. 

First, we introduced a function-level profiler. This profiler shows users all of the operations in a given script or cell that were run on the GPU vs. CPU. It also shows the amount of time each function took on each. 

There are two ways to use the function-level profiler. If running a Jupyter or IPython notebook, users can call %%cuml.accel.profile after cuml.accel has already been loaded and profile an entire cell:

%%cuml.accel.profilefrom sklearn.linear_model import Ridgefrom sklearn.datasets import make_regressionX, y = make_regression(n_samples=100)# Fit and predict on GPUridge = Ridge(alpha=1.0)ridge.fit(X, y)ridge.predict(X)# Retry, using an unsupported hyperparameterridge = Ridge(positive=True)ridge.fit(X, y)ridge.predict(X)

The output of this cell contains the profile result:

cuml.accel profile                                             ┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┓┃ Function      ┃ GPU calls ┃ GPU time ┃ CPU calls ┃ CPU time ┃┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━┩│ Ridge.fit     │         1 │  141.2ms │         1 │      3ms ││ Ridge.predict │         1 │   31.5ms │         1 │   97.3µs │├───────────────┼───────────┼──────────┼───────────┼──────────┤│ Total         │         2 │  172.7ms │         2 │    3.1ms │└───────────────┴───────────┴──────────┴───────────┴──────────┘Not all operations ran on the GPU. The following functions required CPU fallback for the following reasons:* Ridge.fit  - `positive=True` is not supported* Ridge.predict  - Estimator not fit on GPU

The function-level profiler can also be called on a Python script using the --profile flag from the CLI:

python -m cuml.accel --profile script.py

The second profiler is a line-level profiler, showing users where each piece of code is executed line-by-line. Like the function-level profiler, the line-level profiler can be called within a notebook with %%cuml.accel.line_profile.

%%cuml.accel.line_profilefrom sklearn.linear_model import Ridgefrom sklearn.datasets import make_regressionX, y = make_regression(n_samples=100)# Fit and predict on GPUridge = Ridge(alpha=1.0)ridge.fit(X, y)ridge.predict(X)# Retry, using an unsupported hyperparameterridge = Ridge(positive=True)ridge.fit(X, y)ridge.predict(X)
cuml.accel line profile                                                    ┏━━━━┳━━━┳━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓┃  # ┃ N ┃    Time ┃ GPU % ┃ Source                                       ┃┡━━━━╇━━━╇━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩│  1 │ 1 │       - │     - │ from sklearn.linear_model import Ridge       ││  2 │ 1 │       - │     - │ from sklearn.datasets import make_regression ││  3 │   │         │       │                                              ││  4 │   │         │       │                                              ││  5 │ 1 │   1.1ms │     - │ X, y = make_regression(n_samples=100)        ││  6 │   │         │       │                                              ││  7 │   │         │       │                                              ││  8 │   │         │       │ # Fit and predict on GPU                     ││  9 │ 1 │       - │     - │ ridge = Ridge(alpha=1.0)                     ││ 10 │ 1 │ 174.2ms │  99.0 │ ridge.fit(X, y)                              ││ 11 │ 1 │   5.2ms │  99.0 │ ridge.predict(X)                             ││ 12 │   │         │       │                                              ││ 13 │   │         │       │                                              ││ 14 │   │         │       │ # Retry, using an unsupported hyperparameter ││ 15 │ 1 │       - │     - │ ridge = Ridge(positive=True)                 ││ 16 │ 1 │   4.5ms │   0.0 │ ridge.fit(X, y)                              ││ 17 │ 1 │ 172.7µs │   0.0 │ ridge.predict(X)                             ││ 18 │   │         │       │                                              │└────┴───┴─────────┴───────┴──────────────────────────────────────────────┘Ran in 185.6ms, 96.4% on GPU

The line profiler can also be called via the --line-profile flag from the command line:

python -m cuml.accel --line-profile script.py

With these new profiling capabilities, cuml.accel provides users with more tools to make accelerating and debugging machine learning code even easier. 

Process larger, more complex data with the Polars GPU engine powered by NVIDIA cuDF

Work with datasets larger than GPU memory with the new default streaming executor

The streaming execution mode that was introduced as an experimental feature in 25.06 is now the default in the Polars GPU engine. This new executor takes advantage of data partitioning to allow datasets much larger than VRAM (GPU memory) to be processed efficiently. The streaming executor can still fall back to in-memory execution for any unsupported operations, but as of the 25.08 release, streaming execution supports almost all of the operators that are supported for in-memory GPU execution. This unlocks substantial performance and scalability improvements.

For smaller datasets, using the streaming execution mode on a single GPU incurs a very minor performance overhead when compared to using the in-memory engine. However, as dataset size grows and begins to exceed GPU memory, the streaming executor provides massive speedups compared to the in-memory engine.

Figure 1. Performance comparison of the Polars GPU engine’s in-memory and streaming execution modes. The streaming engine is nearly 5x faster on the 300GB larger-than-memory workload. 

For more information on the Polars GPU streaming executor, visit our documentation.

Keep complex data like structs and string operations on the GPU

The Polars GPU engine now supports struct data in columns. Previously, any operations involving structs would fall back to CPU execution, but with the latest release all of these operations are now GPU-accelerated for improved performance:

>>> import polars as pl... ratings = pl.LazyFrame(...     {...         "Movie": ["Cars", "IT", "ET", "Cars", "Up", "IT", "Cars", "ET", "Up", "ET"],...         "Theatre": ["NE", "ME", "IL", "ND", "NE", "SD", "NE", "IL", "IL", "SD"],...         "Avg_Rating": [4.5, 4.4, 4.6, 4.3, 4.8, 4.7, 4.7, 4.9, 4.7, 4.6],...         "Count": [30, 27, 26, 29, 31, 28, 28, 26, 33, 26],...     }... )... ratings.select(pl.col("Theatre").value_counts()).collect(engine=pl.GPUEngine(raise_on_fail=True))...shape: (5, 1)┌───────────┐│ Theatre     ││ ---         ││ struct[2]   │╞═════════==╡│ {"NE",3}    ││ {"ND",1}    ││ {"ME",1}    ││ {"SD",2}    ││ {"IL",3}    │└───────────┘

Additionally, the Polars GPU engine now supports a substantially expanded set of string operators, for example:

>>> ldf = pl.LazyFrame({"foo": [1, None, 2]})>>> ldf.select(pl.col("foo").str.join("-")).collect(engine=gpu_engine)shape: (1, 1)┌─────┐│ foo  ││ ---  ││ str  │╞═════╡│ 1-2  │└─────┘>>> ldf = pl.LazyFrame({...     "lines": [...         "I Like\nThose\nOdds",...         "This is\nThe Way",...     ]... })... ldf.with_columns(...     pl.col("lines").str.extract(r"(T\w+)", 1).alias("matches"),... ).collect(engine=pl.GPUEngine(raise_on_fail=True))...shape: (2, 2)┌─────────┬──────┐│ lines   ┆ matches ││ ---     ┆ ---     ││ str     ┆ str     │╞═════════╪══════╡│ I Like  ┆ Those   ││ Those   ┆         ││ Odds    ┆         ││ This is ┆ This    ││ The Way ┆         │└─────────┴──────┘

The expansion of datatype support further strengthens the capabilities of the Polars GPU engine, accelerating the delivery of the most common end user functionality. 

New algorithms supported in cuML: Spectral Embedding, LinearSVC, LinearSVR, and KernelRidge

With the 25.08 release, cuML has added a Spectral Embedding algorithm for dimensionality reduction and manifold learning. Spectral Embedding is an approach which uses the eigenvalues and eigenvectors of a similarity graph to embed high-dimensional data into a lower-dimensional space. 

The API for the new Spectral Embedding algorithm in cuML matches that of the Spectral Embedding implementation in scikit-learn:

from cuml.manifold import SpectralEmbeddingimport cupy as cpfrom sklearn.datasets import fetch_openml# (70000, 784) -> (70000, 2)mnist = fetch_openml('mnist_784', version=1)X, y = mnist.data, mnist.target.astype(int)spectral = SpectralEmbedding(n_components=2, n_neighbors=None, random_state=42)embedding = spectral.fit_transform(cp.asarray(X, order='C', dtype=cp.float32))

Additionally, cuml.accel now accelerates several new algorithms with zero code changes. LinearSVC and LinearSVR estimators were added in the 25.08 release, which means that all estimators within the support vector machine family are now a part of cuml.accel. 

KernelRidge was also added to cuml.accel, bringing another popular regression algorithm under the zero code change umbrella.

For more information on the algorithms that are supported today, check out our full documentation

Deprecation of CUDA 11 support

Beginning with the 25.08 release, we dropped support for CUDA 11, which includes all containers, published packages, and the ability to build from source. Users who want to continue running CUDA 11 may pin to RAPIDS version 25.06. 

Visit the RAPIDS documentation to learn more. 

Conclusion

The NVIDIA RAPIDS 25.08 release offers a substantial leap forward in accelerating and optimizing data science workflows. With the introduction of the cuml.accel profiler, developers now have powerful tools to diagnose and improve the performance of their machine learning code. Updates to the Polars GPU engine like the streaming executor and expanded data type support enable efficient processing of large datasets, enhancing scalability and performance. Additionally, the inclusion of new algorithms in cuML further streamlines the machine learning ecosystem. These developments collectively contribute to making accelerated data science more accessible and efficient for users. For a deeper dive into all the new features and improvements, be sure to visit the RAPIDS documentation.

We welcome your feedback on GitHub. You can also join the 3,500+ members of the RAPIDS Slack community to talk GPU-accelerated data processing.

If you’re new to RAPIDS, check out these resources to get started and take our Accelerate Data Science Workflows with Zero Code Changes course for free.  To learn more about accelerated data science, explore our DLI Learning Path and enroll in a hands-on course, such as Best Practices in Feature Engineering for Tabular Data With GPU Acceleration.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

RAPIDS GPU加速 数据科学 机器学习 性能优化 Polars cuML cuml.accel CUDA
相关文章