GPU加速数据库与查询引擎提升数据处理效率

As workloads scale and demand for faster data processing grows, GPU-accelerated databases and query engines have been shown to deliver significant price-performance gains compared to CPU-based systems. The high memory bandwidth and thread count of GPUs especially benefit compute-heavy workloads like multiple joins, complex aggregations, strings processing, and more. The growing availability of GPU nodes and the broad feature coverage of GPU algorithms makes GPU data processing more accessible than ever before.

By addressing performance bottlenecks, both data and business analysts can now query massive datasets to generate real-time insights and explore analytics scenarios.

To support the increasing demand, IBM and NVIDIA are working together to bring NVIDIA cuDF to the Velox execution engine, enabling GPU-native query execution for widely used platforms like Presto and Apache Spark. This is an open project.

How Velox and cuDF work together to translate query plans

Velox acts as an intermediate layer, translating query plans from systems like Presto and Spark into executable GPU pipelines powered by cuDF, as shown in Figure 1. For more details, see Extending Velox – GPU Acceleration with cuDF.

In this post, we’re excited to share initial performance results of Presto and Spark using the GPU backend in Velox. We dive into:

End-to-end Presto accelerationScaling up Presto to support multi-GPU executionDemonstrating hybrid CPU-GPU execution in Apache Spark

*Figure 1. A query flows from Presto or Apache Spark through the Velox engine, where it is converted into executable GPU pipelines powered by cuDF*

Moving the entire Presto query plan to GPU for faster execution

The first step of query processing is to translate incoming SQL commands into query plans with tasks for each node in the cluster. On each worker node, the cuDF backend for Velox receives a plan from the Presto coordinator, rewrites the plan using GPU operators, and then executes the plan.

Running Presto plans using Velox with cuDF required improvements to the GPU operators for TableScan, HashJoin, HashAggregation, FilterProject, and more.

TableScan

HashJoin

HashAggregations

Overall, the operator expansion in the cuDF backend for Velox enables end-to-end GPU execution in Presto, making full use of the Presto SQL parser, optimizer, and coordinator.

The team collected query runtime data using benchmarks in Presto tpch (derived from TPC-H) using Parquet data sources with both the Presto C++ and Presto-on-GPU worker types. Please note that Presto C++ was not able to complete Q21 with standard configuration options, so the figure highlights the total runtime for the 21 successful queries.

As shown in Figure 2, at scale factor 1,000, we observed 1,246 seconds runtime for Presto C++ on AMD 5965X, 133.8 seconds runtime for Presto on NVIDIA RTX PRO 6000 Blackwell Workstation, and 99.9 seconds runtime for Presto on NVIDIA GH200 Grace Hopper Superchip. We also used CUDA managed memory to complete Q21 on GH200 (see Figure 2 asterisk), yielding 148.9 seconds runtime for Presto GPU on the full query set.

*Figure 2. Runtime results for 21 of 22 queries defined in Presto tpch, executed with single-node Presto C++ on CPU and Presto on NVIDIA GPUs at scale factor 1,000*

Multi-GPU Presto for faster data exchange and lower query runtime

In distributed query execution, Exchange is a critical operator that controls the data movement between workers on the same node and also between nodes. GPU-accelerated Presto uses a UCX-based Exchange operator that supports running the entire execution pipeline on GPU. The UCX core leverages high bandwidth NVLink for intra-node connectivity and RoCE or InfiniBand for internode connectivity. UCX, or Unified Communication – X Framework, is an open source communication library designed to achieve the highest performance for HPC applications.

Velox supports several Exchange types for different types of data movements: Partitioned, Merge, and Broadcast. Partitioned Exchange uses a hash function to partition input data and then sends the partitions to other workers for further processing. Merge Exchange receives multiple input partitions from other workers and then produces a single, sorted output partition. Broadcast Exchange loads the data in one worker and then copies the data to all other workers. Integration of GPU exchange into the cuDF backend for Velox is in progress, and the implementation is available on mainline Velox.

As shown in Figure 3, Presto achieves efficient performance on GPU with new UCX-based exchange, especially when high-bandwidth intranode connectivity is provisioned between GPUs. An eight-GPU NVIDIA DGX A100 node delivered >6x speedup when using NVLink in the exchange operator compared to using the Presto baseline HTTP exchange. Results were collected for Presto on GPU with both the baseline HTTP Exchange method, and the UCX-based cuDF Exchange method. With eight GPU workers, Presto can finish all 22 queries with the default async memory allocation, without using managed memory.

*Figure 3. Runtime results for the 22 queries defined in Presto tpch benchmark, executed with Presto GPU on NVIDIA DGX A100 (eight A100 GPUs) at scale factor 1,000*

Hybrid CPU-GPU execution in Apache Spark

While the Presto integration focuses on end-to-end GPU execution, the Apache Spark integration with Apache Gluten and cuDF is currently focused on offloading specific query stages. This capability allows the most compute-intensive parts of workloads to be dispatched to GPUs, and this strategy can make the best use of GPU resources in hybrid clusters containing both CPU and GPU nodes.

For example, the second stage of TPC-DS Query 95 SF100 is compute intensive and can slow down CPU-only clusters. Offloading this stage to GPU achieves significant performance gains. CPU capacity remains on the cluster, available for other queries or workloads.

As shown in Figure 4, even when the first stage of TableScan is run with CPU execution, efficient interoperability between CPU and GPU enables a faster total runtime when the second stage offloads to GPU. The condition CPU only uses eight vCPUs and First Stage CPU+GPU uses eight vCPUs and one NVIDIA T4 GPU (g4dn.2xlarge).

Figure 4. Runtime results for the query 95, as defined in Gluten tpcds, executed with single-node, single-GPU at scale factor 100Get involved with GPU-powered, large-scale data analyticsDriving GPU acceleration in the shared execution engine Velox unlocks performance gains for a wide array of downstream systems across the data processing ecosystem. The team is working with contributors across many companies to implement reusable GPU operators in Velox, and in turn accelerate Presto, Spark (through Gluten), and other systems. This approach reduces duplication, simplifies maintenance, and introduces new innovations across the open data stack.We’re excited to share this open source work with the community and hear your feedback. We invite you to: AcknowledgmentsMany developers contributed to this work. IBM contributors include Zoltán Arnold Nagy, Deepak Majeti, Daniel Bauer, Chengcheng Jin, Luis Garcés-Erice, Sean Rooney, and Ali LeClerc. NVIDIA contributors include Greg Kimball, Karthikeyan Natarajan, Devavret Makkar, Shruti Shivakumar, and Todd Mostak.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

GPU加速数据库查询引擎 Velox cuDF Presto Apache Spark 数据处理性能优化 IBM NVIDIA GPU acceleration Databases Query Engines Data Processing Performance Optimization

The New DBfication of ML/AI with Arun Kumar - #553

SLIDE: Smart Algorithms over Hardware Acceleration for Large-Scale Deep Learning with Beidi Chen - #356

Scalable and Maintainable Workflows at Lyft with Flyte w/ Haytham AbuelFutuh and Ketan Umare - #343

Video Object Detection At Scale with Reza Zadeh - TWiML Talk #34

支援邊緣AI推論運算，IBM新款入門級Power10伺服器上陣

NVIDIA Grace Hopper™ Superchips to Speed Scientific Research and Discovery

Introducing NVIDIA’s CUDA-Q™ Platform For Quantum Computing

本田与IBM就下一代半导体和软件技术达成合作

IBM将在爱尔兰增加800个人工智能相关岗位

Method identified to double computer processing speeds

.footer { width: 100%; /* 原先页面已经预留了空间 */ /* height: 2.3rem; */ position: relative; } .footer.padding-bottom{ padding-bottom: 1.2rem; } .footer .fixed-footer { position: fixed; bottom: 0; left: 0; width: 100%; height: 2.3rem; background-color: #191919; z-index: 100; } .footer.padding-bottom .fixed-footer{ padding-bottom: 1.2rem; } .footer .fixed-footer .flex-content{ position: absolute; top: 0; left: 0; right: 0; bottom: 0; height: 2.3rem; display: flex; box-sizing: border-box; align-items: center; justify-content: space-between; padding:0 .55rem; } .footer .icon-left, .footer .icon-right{ position: absolute; width: .55rem; height: .55rem; top: -0.54rem; } .footer .icon-left{ left: 0; } .footer .icon-left::after{ position: absolute; width: .55rem; height: .55rem; content: ''; bottom: -0.01rem; left: -0.01rem; background-image: url('data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAUCAYAAACNiR0NAAABC0lEQVQ4T63TMUrEUBDG8e+L5AYWeQFLC72A17Cw8gYqXkDBTryAoBaWixbWWlor1haW8pKZ4mElFj42I4GNhe6ym82bA/z4w8wQwAqAMRINy7Jcres6JPLAoig2VfU1JbijqnfJQOfcqYgcpwSfRGQrJTiOMa6FEOoUKJ1zBuBIRM5Sgu8isg7geyjaFcLM9lX1IhkIIJDcGHrkv4WTshsR2R1S+RdsrT0RuVwW/QeaWQSwrar3y6DTClvny8zal3zoi84C261HkocictUHnQl2CMnbLMsOvPcfi8BzwQkSzOxEVa/nHf+iYBfnSZ6THFVV5acV9wU7ozGzZ5KPAF6apnnL87z23n/+ADjcghv4tAnCAAAAAElFTkSuQmCC'); background-size: 100% 100%; background-repeat: no-repeat; } .footer .icon-right{ right: 0; } .footer .icon-right::after{ position: absolute; width: .55rem; height: .55rem; content: ''; bottom: -0.01rem; right: -0.01rem; background-image: url('data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAUCAYAAACNiR0NAAABC0lEQVQ4T63TMUrEUBDG8e+L5AYWeQFLC72A17Cw8gYqXkDBTryAoBaWixbWWlor1haW8pKZ4mElFj42I4GNhe6ym82bA/z4w8wQwAqAMRINy7Jcres6JPLAoig2VfU1JbijqnfJQOfcqYgcpwSfRGQrJTiOMa6FEOoUKJ1zBuBIRM5Sgu8isg7geyjaFcLM9lX1IhkIIJDcGHrkv4WTshsR2R1S+RdsrT0RuVwW/QeaWQSwrar3y6DTClvny8zal3zoi84C261HkocictUHnQl2CMnbLMsOvPcfi8BzwQkSzOxEVa/nHf+iYBfnSZ6THFVV5acV9wU7ozGzZ5KPAF6apnnL87z23n/+ADjcghv4tAnCAAAAAElFTkSuQmCC'); background-size: 100% 100%; background-repeat: no-repeat; transform: rotateY(180deg); } .footer .flex-content .open-weapp { position: absolute; top: 0; left: 0; width: 100%; height: 100%; z-index: 10; opacity: 0; } .footer .flex-content .footer-left, .footer .flex-content .footer-right { position: relative; font-weight: bold; } .footer .flex-content .footer-left{ width: 4.35rem; height: 1.25rem; } .footer .flex-content .footer-left .footer-left-content { position: absolute; top: 0; left: 0; width: 100%; height: 100%; display: flex; align-items: center; color: #D4D4D4; font-size: .65rem; } .footer .flex-content .footer-left .footer-left-content .logo{ width: 1.1rem; height: 1.1rem; background-image: url('http://app.myzaker.com/news/images/logo_icon.png'); background-size: 100% 100%; background-repeat: no-repeat; margin-right: .35rem; border-radius: 50%; } .footer .flex-content .footer-right{ width: 4.38rem; height: 1.25rem; line-height: 1.25rem; display: block; box-sizing: border-box; } .footer .flex-content .footer-right .open-weapp-btn{ position: absolute; top: 0; left: 0; right: 0; bottom: 0; background-color: #2B2B2B; border-radius: .15rem; color: #D4D4D4; font-size: .65rem; text-align: center; display: block; } var browser = { versions: (function () { var u = navigator.userAgent.toLowerCase(), isPad = false,isAndroidPad = false,isIpad = false,isMobile = false,isPc = false; if(u.indexOf('')) if(u.indexOf('android') > -1){ if(u.indexOf('mobile') == -1){ isAndroidPad = true; } } if(u.indexOf('ipad') > -1){ isIpad = true; } if(isAndroidPad||isIpad){ isPad = true; }else if((u.indexOf('mobile') > -1 && !isPad ) || (u.indexOf('android') > -1 && !isAndroidPad) || (u.indexOf('phone') > -1)){ isMobile = true; }else{ isPc = true; } return { android: u.indexOf('android') > -1 || u.indexOf('Linux') > -1, iPhone: u.indexOf('iphone') > -1, isPad: isPad, isMobile:isMobile, isPc:isPc, wx:u.toLowerCase().indexOf('micromessenger') > -1, }; })() } var checkInZaker = function(){ if (navigator.appinfo || navigator.userAgent.match(/zaker/ig)) { return true; } return false; } if( location.href.indexOf('mobile=1')<0 && (browser.versions.isPc || browser.versions.isPad) ){ var style = '<style type="text/css">'; style+= 'html{background-color:#f8f8f8;}'; style+= '#body{width:720px;margin:0 auto;background-color:#fff;border-left:1px solid #e8e8e8;border-right:1px solid #e8e8e8;font-style:normal}'; style+= '#temple_title,#content_text,.icon-font-origin,#top5{padding:0 50px;}'; style+= '#downTips{width:720px;}'; style+= '#qrcode{position:fixed;background-color:#fff;margin:44px 0 0 740px;}'; style+= '#downTips{display:none;}'; style+= '</style>'; document.write(style); } var _$ = function(id){return document.querySelector(id);}, isWap = true; var qrcodeHtml = '' if(location.href.indexOf('mobile=1')<0 && browser.versions.isPc){ qrcodeHtml += '<img id="qrcode" src="/static/image/qrcode_dingyuehao.jpg"/>' } qrcodeHtml += '<div class="zk_top_barwrap"><div class="zk_top_bar"><a href="/" class="zk_top_bar_logo"></a></div></div>' $('#body').prepend(qrcodeHtml); var new_style = ''; var vo = document.createElement("a"); vo.className = 'icon-font-origin-btn'; vo.style.borderBottom = 'none'; vo.style.color = '#00abff'; vo.style.marginLeft = '0px'; if(new_style){ vo.style.cssText="border-bottom-style: none;font-size: 11px;color: #ababab;margin-left: 6px;"; document.getElementById('ID_disclaimer').style.cssText='text-align: left;color:#ababab;font-size: 16px;line-height: 32px;padding:0;padding-top: 4px;'; } vo.href = 'https://developer.nvidia.com/blog/accelerating-large-scale-data-analytics-with-gpu-native-velox-and-nvidia-cudf/'; vo.innerHTML = '查看原文'; var el_disclaimer = _$("#ID_disclaimer"); if(el_disclaimer){ el_disclaimer.appendChild(vo); } //图片初始化 (function () { var imglazy = document.querySelectorAll('.img_box .lazy'); imglazy = Array.prototype.slice.call(imglazy); imglazy.forEach(function(img){ // 获取宽高 var dWidth = img.dataset['width']; var dHeight = img.dataset['height']; // 获取父元素 var parentEle = img; do{ parentEle = parentEle.parentNode; } while(!parentEle.classList.contains('img_box') || parentEle.id == "content") // 获取图片的父容器占宽 var parentWidth = parentEle.offsetWidth; // 1. 图片原宽度大于容器宽度70%，撑到100% // 2. 图片原宽度大于容器宽度40%，小于容器宽度70%，保持图片原尺寸 // 3. 图片原宽度小于容器宽度40%，撑到40% var maxRate = 0.7; var minRate = 0.4; // 计算阀值 var maxWidth = maxRate * parentWidth; var minWidth = minRate * parentWidth; // 最终设定图片的宽高 var imgWidth, imgHeight; if (dWidth) { if (dWidth > maxWidth) { imgWidth = parentWidth; } else if (dWidth > minWidth) { imgWidth = dWidth; img.parentNode.style['display'] = 'inline-block'; // img.parent('.content_img_div').css('display', 'inline-block'); } else { imgWidth = minWidth; img.parentNode.style['display'] = 'inline-block'; // img.parent('.content_img_div').css('display', 'inline-block'); } // 计算高度 imgHeight = dHeight / dWidth * imgWidth; } else { imgWidth = parentWidth; } // 设置图片大小 img.style['width'] = imgWidth + "px"; img.style['height'] = imgHeight + "px"; }); })(); var inzaker = (navigator.userAgent.match(/zaker/ig)) ? true : false; if(!inzaker && !navigator.userAgent.match(/AlipayClient/ig) ){ if(document.querySelector('.ntpl_head')){ (function(){ function getStyle(obj,attr){ if(obj.currentStyle){ return obj.currentStyle[attr]; }else{ return document.defaultView.getComputedStyle(obj,null)[attr]; } } var $ntplHead = document.querySelector('.ntpl_head'), pt = getStyle($ntplHead, 'paddingTop'); $ntplHead.style.paddingTop = (parseInt(pt, 10) - 20)+'px'; })(); } } window.zkgetWebConfig = function(data) { inzaker = true; if(data.appType == 'elderly'){ document.getElementsByTagName('body')[0].className += ' body_elderly'; } }; window.dataLayer = window.dataLayer || []; function gtag(){dataLayer.push(arguments);} gtag('js', new Date()); gtag('config', 'G-LT4LDFPVLZ');