Second Brain: Crafted, Curated, Connected, Compounded on 10月02日 21:10
现代OLAP系统:超越传统数据立方体
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了现代OLAP系统如何取代传统的OLAP数据立方体,重点在于优化查询响应速度和处理大规模数据。文章介绍了Real-Time OLAP系统和Real-Time Data Warehouse的概念,并对比了DuckDB与分布式OLAP系统(如Druid、ClickHouse、Pinot)的差异。在开源OLAP系统方面,通过Michael Driscoll的评价和Ivan Torres的实践经验,突出了StarRocks在多功能性、实时数据处理、即席分析和与外部数据源集成方面的优势,并将其与ClickHouse和Druid进行了比较。最后,文章根据不同部署场景(小型、大型、云原生、日志分析、统一分析、混合数据仓库)提供了ClickHouse、Pinot、Druid、StarRocks和Doris等系统的选择建议。

💡 **现代OLAP系统是传统数据立方体的替代品**:现代OLAP系统旨在提供更快的查询响应时间,通过采用大数据优化技术来替换传统的OLAP数据立方体,同时保持现有架构不变。这类系统也被称为实时OLAP系统或实时数据仓库。

🦆 **DuckDB与分布式OLAP系统的差异**:DuckDB以其轻量级、进程内和高可移植性著称,可视为OLAP领域的SQLite。而Druid、ClickHouse、Pinot等则是为大规模数据工作负载设计的分布式系统。Rill选择DuckDB处理100GB以下的数据量,而大型工作负载则倾向于分布式系统。

⭐ **StarRocks的多功能性与优势**:StarRocks在开源OLAP领域表现突出,尤其在多功能性方面。它支持实时数据处理,能够进行出色的即席分析,并能与Iceberg等外部目录集成以读取外部表。StarRocks还支持存储与计算分离,并通过磁盘缓存实现高性能。与Druid(即席连接困难)和ClickHouse(外部表集成不便)相比,StarRocks提供了更灵活的解决方案,并对开源社区的承诺也备受赞誉。

📊 **基于场景的OLAP系统选择建议**:根据不同的部署需求,推荐了多种OLAP系统。对于小型到中型部署,ClickHouse因其易用性是通用选择。大型本地部署可考虑ClickHouse、Pinot和Druid。云原生实现可选择ClickHouse Cloud、StarRocks和Doris。Druid和Pinot特别适合日志分析和时间序列数据。ClickHouse、StarRocks和Doris擅长统一分析和查询联合。StarRocks和Doris则在混合数据仓库OLAP解决方案中提供传统数据仓库能力与实时OLAP性能的结合。

With modern OLAP systems, you replace your Traditional OLAP Cubes one-to-one with another technology. Therefore, you keep everything the same on your current architecture but replace your cubes with a modern big data-optimized technology that focuses on the fastest query response time. See the Appendix for a comparison between modern OLAP Technologies.

Another term would be Real-Time OLAP systems, or Real-Time Data Warehouse.

Traditional open-source OLAP system, or Modern OLAP systems are:


Source

# https://www.ssp.sh/brain/Modern%20OLAP%20Systems-20250127211718269.webhttps://www.ssp.sh/brain/Modern%20OLAP%20Systems-20250127211726579.webhttps://www.ssp.sh/brain/Modern%20OLAP%20Systems-20250127211738316.webpbphttps://www.ssp.sh/brain/Modern%20OLAP%20Systems-20250127211747214.webhttps://www.ssp.sh/brain/Modern%20OLAP%20Systems-20250127211652644.webhttps://www.ssp.shhttps://www.ssp.sh/brain/Modern%20OLAP%20Systems-20250127212018957.webpc="/brain/Modern%20OLAP%20Systems-20250127211652644.webp">

Source of above images.


GitHub Star History


GitHub Star History

To a certain extent, Semantic Layer and its Semantic Layer Tools are Cubes as well. Especially if they have an OLAP Cache Layer.

# DuckDB vs Modern OLAP system

DuckDB distinguishes itself on being lightweight, in-process, and thus highly portable. It is SQLite but for OLAP. Druid/ClickHouse/Pinot is a distributed system designed for large-scale data workloads. Rill use DuckDB for workloads that are 100GB or less. Michael Driscoll on LinkedIn: DuckDB is the world’s most downloaded analytics database, with 1+ million…

# Comparison

# Starrocks vs. ClickHouse/Druid

Ivan Torres says:

I use StarRocks open source on K8s and benchmarked it against Druid. I also used ClickHouse before. For real time data they all come pretty close, some optimizations of each database work better for specific use cases, but you can achieve pretty similar numbers with the three of them

What made me go to StarRocks is that it is a lot more versatile. It also does pretty good ad hoc analytics and integrates with external catalogs like Iceberg for reading external tables. You can also have full separation of storage and compute and achieve good performance with disk caching.

In Druid joining tables as hoc is pretty much impossible. You need a lot of work upstream for it to perform. In ClickHouse the integration to external tables is useless and you need to build everything by hand. Last, the commitment of StarRocks with open source is the best of all

Check out more on Traditional OLAP Cube Replacements and Traditional OLAP Cubes.

# Recommendations

Based on the fantastic write-up by Alireza Sadeghi, he suggests these choices:

Small-to-medium Deployments:
Overall ClickHouse is an excellent real-time OLAP engine suitable for small-to-medium environments. Its straightforward deployment, management, and architecture make it the preferred choice for general use cases.

Large On-Premise Deployments:
For large-scale implementations, particularly on Hadoop or similar platforms, ClickHousePinot, and Druid are leading candidates. The final selection should align with specific workload requirements and use cases.

Cloud-Native Implementations:
Cloud-native deployments utilising object storage as the main persistence layer can leverage managed solutions like ClickHouse Cloud, or platforms such as StarRocks and Doris. However, consider that StarRocks and Doris introduced their decoupled architecture recently, suggesting careful evaluation for production readiness.

Log Analytics & Time-series Data:
Druid and Pinot demonstrate particular strength in processing immutable time-series data, including web logs, machine logs, and clickstream events. Their support for hybrid tables makes them ideal for Lambda-style architectures.

Unified Analytics with Query Federation:
ClickHouseStarRocks, and Doris excel in unified analytics scenarios, offering query federation capabilities that enable seamless data access across diverse sources such as data lakes, lakehouses and DBMS systems.

Hybrid Data Warehouse-OLAP Solutions:
StarRocks and Doris provide a middle ground, combining traditional data warehouse capabilities with real-time OLAP performance. They offer comprehensive CRUD operations, complex join support (including star schema), and ACID guarantees to some extend.

# Further Readings


Origin:
References: Vector Database (and Engine)
Created

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

OLAP Modern OLAP Data Cubes Real-Time OLAP Data Warehouse DuckDB Druid ClickHouse Pinot StarRocks Doris Big Data Analytics Open Source
相关文章