Second Brain: Crafted, Curated, Connected, Compounded on 10月02日
Ballista:基于Apache Arrow的分布式SQL查询引擎
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Ballista是一款用Rust编写的分布式计算平台,它建立在Apache Arrow之上,旨在成为Apache Spark的有力竞争者。Ballista的核心优势在于其高效的内存管理和跨语言支持,通过Apache Arrow作为内存模型和网络协议,显著降低了序列化开销。其架构包含调度器和执行器进程,支持通过Python和Rust提交SQL及DataFrame查询,并可通过JDBC驱动连接到常用的数据分析工具。与Spark相比,Ballista在内存使用上更为高效,可将内存占用降低5-10倍,从而在单节点上容纳更多处理任务,减少分布式计算的开销。Ballista目前已从DataFusion项目中分离,独立发展,专注于为终端用户提供分布式查询服务。

🚀 **高性能与内存效率**:Ballista采用Rust作为主要执行语言,确保了确定性的内存使用,避免了垃圾回收的延迟。它基于Apache Arrow,支持向量化处理(SIMD和GPU)和高效压缩,使得内存占用相比Apache Spark可降低5-10倍,从而提高单节点处理能力并减少分布式计算开销。

🌐 **跨语言支持与低开销**:Ballista将Apache Arrow作为其内存模型和网络协议,允许不同编程语言之间以最小的序列化开销交换数据。这意味着开发者可以使用Python、Rust等多种语言与Ballista进行交互,而无需担心性能损失。

🔧 **分布式架构与部署**:Ballista包含标准的调度器和执行器进程,支持在Docker、Kubernetes等容器化环境中部署。用户可以通过Python和Rust提交SQL及DataFrame查询,并通过JDBC驱动连接到DataGrip或Tableau等工具,提供了灵活的查询接入方式。

🧭 **发展与定位**:Ballista最初是DataFusion的一部分,现已分离出来独立发展,专注于成为一个面向终端用户的分布式系统。它旨在与DataFusion协同工作,未来有望实现更无缝的切换,让用户能够根据需求选择in-process查询引擎(DataFusion)或分布式查询系统(Ballista)。

Ballista: Distributed SQL Query Engine, built on Apache Arrow. Competes with Apache Spark. It’s basically a scheduler for DataFusion on top of Apache Arrow.

Ballista is a distributed compute platform primarily implemented in Rust and powered by Apache Arrow. It is built on an architecture that allows other programming languages to be supported as first-class citizens without paying a penalty for serialization costs.

The foundational technologies in Ballista are:

# How it works

Ballista has a scheduler and an executor process that are standard Rust executables and can be executed directly, but Dockerfiles are provided to build images for use in containerized environments, such as Docker, Docker Compose, and Kubernetes. See the deployment guide for more information

SQL and dataframe queries can be submitted from Python and Rust, and SQL queries can be submitted via the Apache Arrow Flight (Protocol) SQL JDBC driver, supporting your favorite JDBC compliant tools such as DataGrip or tableau. For setup instructions, please see the FlightSQL guide.

# How does Ballista (Arrow-Rust) compare to Apache Spark?

from RW Overview — Arrow DataFusion Documentation

Although Ballista is largely inspired by Apache Spark, there are some key differences.

    The choice of Rust as the main execution language means that memory usage is deterministic and avoids the overhead of GC pauses.designed from the ground up:
      columnar data, enabling a number of efficiencies such as vectorized processing (SIMD and GPU) and efficient compression.Although Spark does have some columnar support, it is still largely row-based today.
    The combination of Rust and Apache Arrow provides excellent memory efficiency and memory usage can be 5x - 10x lower than Apache Spark in some cases, which means that more processing can fit on a single node, reducing the overhead of distributed compute.The use of Apache Arrow as the memory model and network protocol means that data can be exchanged between executors in any programming language with minimal serialization overhead.

Another great read RW Delta Lake Without Spark (Delta-Rs). Innovation, Cost Savings, and Other Such Matters. - Confessions of a Data Guy

# Ballista Roadmap

Unlike written in the docs RW Roadmap, Ballista was part of DataFusion, but separated now again. Andy Grove, creator of DataFusion and Ballista says (2023-02-22):

Ballista was part of the DataFusion repo and got moved out a while back so that each project could focus on the needs of the user base. DataFusion is a framework for in-process query engines that many products are now building on top of. Ballista is an end-user distributed system (although some companies are also using this as a foundation for new systems).

I am hoping that it will soon be pretty seamless to switch between DataFusion and Ballista when using these systems from Python by just changing the imports, but for now they are quite separate systems.

# Running Queries

Example sqlbench-runners/sqlbench-ballista.py at main · sql-benchmarks/sqlbench-runners · GitHub with Python.


Origin: David Gasquez
References: apache/arrow-ballista
Created 2022-10-20

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Ballista Apache Arrow 分布式SQL Rust Apache Spark DataFusion 大数据
相关文章