Nvidia Developer 10月06日 21:16
GPU加速数据库与查询引擎提升数据处理效率
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

随着数据处理需求的增长,GPU加速数据库和查询引擎正展现出超越CPU系统的显著性能优势。GPU的高内存带宽和多线程处理能力特别适合处理join、聚合、字符串处理等计算密集型任务。IBM与NVIDIA的合作,将NVIDIA cuDF集成到Velox执行引擎中,实现了对Presto和Apache Spark等平台的GPU原生查询执行。该项目旨在通过优化TableScan、HashJoin、HashAggregation等算子,实现端到端的Presto GPU加速,并探索多GPU执行和Spark的CPU-GPU混合执行模式,以应对海量数据处理的挑战,为数据和业务分析师提供实时洞察。

🚀 GPU的引入显著提升了数据处理的性能,特别是在处理join、聚合和字符串操作等计算密集型工作负载时,其高内存带宽和线程数优势得以充分发挥,为数据和业务分析师提供了更快的实时洞察能力。

🤝 IBM与NVIDIA的合作是关键驱动力,通过将NVIDIA cuDF集成到Velox执行引擎,实现了对Presto和Apache Spark等广泛使用平台的原生GPU查询执行,为数据处理生态系统带来了重要的技术革新。

✨ Velox执行引擎作为中间层,能够将Presto和Spark的查询计划转化为由cuDF驱动的可执行GPU流水线。通过对TableScan、HashJoin、HashAggregation等算子的优化,实现了Presto的端到端GPU加速,包括多GPU执行和Spark的CPU-GPU混合执行模式,显著缩短了查询响应时间。

📊 性能测试结果显示,GPU加速的Presto在处理大规模数据集时,相比CPU版本有数倍的性能提升。例如,在TPCH基准测试中,使用NVIDIA GPU的Presto执行速度远超CPU版本,尤其是在多GPU节点和利用NVLink高带宽连接时,性能增益更为显著。

As workloads scale and demand for faster data processing grows, GPU-accelerated databases and query engines have been shown to deliver significant price-performance gains compared to CPU-based systems. The high memory bandwidth and thread count of GPUs especially benefit compute-heavy workloads like multiple joins, complex aggregations, strings processing, and more. The growing availability of GPU nodes and the broad feature coverage of GPU algorithms makes GPU data processing more accessible than ever before.

By addressing performance bottlenecks, both data and business analysts can now query massive datasets to generate real-time insights and explore analytics scenarios.

To support the increasing demand, IBM and NVIDIA are working together to bring NVIDIA cuDF to the Velox execution engine, enabling GPU-native query execution for widely used platforms like Presto and Apache Spark. This is an open project. 

How Velox and cuDF work together to translate query plans

Velox acts as an intermediate layer, translating query plans from systems like Presto and Spark into executable GPU pipelines powered by cuDF, as shown in Figure 1. For more details, see Extending Velox – GPU Acceleration with cuDF.

In this post, we’re excited to share initial performance results of Presto and Spark using the GPU backend in Velox. We dive into:

    End-to-end Presto accelerationScaling up Presto to support multi-GPU executionDemonstrating hybrid CPU-GPU execution in Apache Spark
Figure 1. A query flows from Presto or Apache Spark through the Velox engine, where it is converted into executable GPU pipelines powered by cuDF

Moving the entire Presto query plan to GPU for faster execution

The first step of query processing is to translate incoming SQL commands into query plans with tasks for each node in the cluster. On each worker node, the cuDF backend for Velox receives a plan from the Presto coordinator, rewrites the plan using GPU operators, and then executes the plan. 

Running Presto plans using Velox with cuDF required improvements to the GPU operators for TableScan, HashJoin, HashAggregation, FilterProject, and more. 

    TableScan: The Velox TableScan was extended on CPU to be compatible with GPU I/O, decompression, and decoding components in cuDF.HashJoin: The available join types were expanded to include left, right, and inner, as well as support for filters and null semantics. HashAggregations: A streaming interface was introduced to manage partial and final aggregations. 

Overall, the operator expansion in the cuDF backend for Velox enables end-to-end GPU execution in Presto, making full use of the Presto SQL parser, optimizer, and coordinator.

The team collected query runtime data using benchmarks in Presto tpch (derived from TPC-H) using Parquet data sources with both the Presto C++ and Presto-on-GPU worker types. Please note that Presto C++ was not able to complete Q21 with standard configuration options, so the figure highlights the total runtime for the 21 successful queries.

As shown in Figure 2, at scale factor 1,000, we observed 1,246 seconds runtime for Presto C++ on AMD 5965X, 133.8 seconds runtime for Presto on NVIDIA RTX PRO 6000 Blackwell Workstation, and 99.9 seconds runtime for Presto on NVIDIA GH200 Grace Hopper Superchip. We also used CUDA managed memory to complete Q21 on GH200 (see Figure 2 asterisk), yielding 148.9 seconds runtime for Presto GPU on the full query set. 

Figure 2. Runtime results for 21 of 22 queries defined in Presto tpch, executed with single-node Presto C++ on CPU and Presto on NVIDIA GPUs at scale factor 1,000

Multi-GPU Presto for faster data exchange and lower query runtime

In distributed query execution, Exchange is a critical operator that controls the data movement between workers on the same node and also between nodes. GPU-accelerated Presto uses a UCX-based Exchange operator that supports running the entire execution pipeline on GPU. The UCX core leverages high bandwidth NVLink for intra-node connectivity and RoCE or InfiniBand for internode connectivity. UCX, or Unified Communication – X Framework, is an open source communication library designed to achieve the highest performance for HPC applications. 

Velox supports several Exchange types for different types of data movements: Partitioned, Merge, and Broadcast. Partitioned Exchange uses a hash function to partition input data and then sends the partitions to other workers for further processing. Merge Exchange receives multiple input partitions from other workers and then produces a single, sorted output partition. Broadcast Exchange loads the data in one worker and then copies the data to all other workers. Integration of GPU exchange into the cuDF backend for Velox is in progress, and the implementation is available on mainline Velox.

As shown in Figure 3, Presto achieves efficient performance on GPU with new UCX-based exchange, especially when high-bandwidth intranode connectivity is provisioned between GPUs. An eight-GPU NVIDIA DGX A100 node delivered >6x speedup when using NVLink in the exchange operator compared to using the Presto baseline HTTP exchange. Results were collected for Presto on GPU with both the baseline HTTP Exchange method, and the UCX-based cuDF Exchange method. With eight GPU workers, Presto can finish all 22 queries with the default async memory allocation, without using managed memory.

Figure 3. Runtime results for the 22 queries defined in Presto tpch benchmark, executed with Presto GPU on NVIDIA DGX A100 (eight A100 GPUs) at scale factor 1,000 

Hybrid CPU-GPU execution in Apache Spark

While the Presto integration focuses on end-to-end GPU execution, the Apache Spark integration with Apache Gluten and cuDF is currently focused on offloading specific query stages. This capability allows the most compute-intensive parts of workloads to be dispatched to GPUs, and this strategy can make the best use of GPU resources in hybrid clusters containing both CPU and GPU nodes.

For example, the second stage of TPC-DS Query 95 SF100 is compute intensive and can slow down CPU-only clusters. Offloading this stage to GPU achieves significant performance gains. CPU capacity remains on the cluster, available for other queries or workloads.

As shown in Figure 4, even when the first stage of TableScan is run with CPU execution, efficient interoperability between CPU and GPU enables a faster total runtime when the second stage offloads to GPU. The condition CPU only uses eight vCPUs and First Stage CPU+GPU uses eight vCPUs and one NVIDIA T4 GPU (g4dn.2xlarge).

Get involved with GPU-powered, large-scale data analytics

Driving GPU acceleration in the shared execution engine Velox unlocks performance gains for a wide array of downstream systems across the data processing ecosystem. The team is working with contributors across many companies to implement reusable GPU operators in Velox, and in turn accelerate Presto, Spark (through Gluten), and other systems. This approach reduces duplication, simplifies maintenance, and introduces new innovations across the open data stack.

We’re excited to share this open source work with the community and hear your feedback. We invite you to: 

Acknowledgments

Many developers contributed to this work. IBM contributors include Zoltán Arnold Nagy, Deepak Majeti, Daniel Bauer, Chengcheng Jin, Luis Garcés-Erice, Sean Rooney, and Ali LeClerc. NVIDIA contributors include Greg Kimball, Karthikeyan Natarajan, Devavret Makkar, Shruti Shivakumar, and Todd Mostak.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

GPU加速 数据库 查询引擎 Velox cuDF Presto Apache Spark 数据处理 性能优化 IBM NVIDIA GPU acceleration Databases Query Engines Data Processing Performance Optimization
相关文章