OLAP技术演进与应用

# What is OLAP?

OLAP stands for Online Analytical Processing. It’s a powerful tool for multidimensional business data analysis, offering complex calculations, trend analysis, and advanced data modeling capabilities.

An OLAP Cube is a specialized multidimensional database, optimized for data warehouse and OLAP applications. It stores data in a multi-dimensional form, primarily for reporting. In these cubes, data (measures) are organized by dimensions.

Microsoft, in the late 1990s, introduced a query language for OLAP cubes called Multidimensional Expressions (MDX), which many other multidimensional database vendors have since adopted. However, managing an OLAP cube with MDX requires specific expertise.

# Modern vs. Traditional OLAP Cubes

The comparison between Modern OLAP Systems and Traditional OLAP Cubes is intriguing.

SSAS embodies traditional OLAP. It remains widely used and is a prominent figure in the OLAP cube world. Developed in the late 1990s, SSAS introduced the MDX query language for data querying and analysis.

Traditional cubes had the advantage of pre-processing, with most queries cached in advance. The disadvantage was the inflexibility of changing measures quickly. Alterations required modifying the solution, redeploying it in production, and reprocessing the entire cube.

Modern OLAP systems, in contrast, allow defining queries and measures at the time of querying without pre-processing the entire cube. In this chapter, I focus on modern OLAP systems like Apache Druid, Apache Pinot, and ClickHouse. These systems offer exceptional performance for customer-facing or sub-second experiences, which Cloud Data Warehouses cannot match.

Modern OLAP systems closely resemble a Semantic Layer with caching functionality. A Semantic Layer is essentially an OLAP cube but with additional features like access permissions, an API layer, and included data modeling. More insights on this can be found in The Rise of the Semantic Layer.

# Postgres become an OLAP system too?

Latest, PostgreSQL is becoming an OLAP system. MotherDuck recently published pg_duckdb with backing from Microsoft, Neon, and others. pg_duckdb extension integrates DuckDB with PostgresSQL. As well as ParadeDB has seen some trackion with its pg_lakehouse, pg_analytics, and pg_search PostgreSQL extensions.

But there is also Postgres has capabilities to run as Postgres OLAP.

There is no OLAP without Cubes, therefore these two terms are highly interlinked.

But, OLAP is a broad concept, encompassing the entire process of online analytical processing. This includes not just the storage of data, but also the methods and technologies used for enabling and performing complex analytical operations. It’s a category of software that allows users to analyze information from multiple database systems at the same time. It’s about the capability to process large volumes of data quickly to extract actionable insights, often in a business context. OLAP is the ‘how’ of multidimensional data analysis.

An OLAP cube, on the other hand, is a specific data structure or a storage paradigm within the OLAP framework. It’s a method to store data in a multidimensional form, which makes it easier to run complex queries and generate reports. Each dimension of the cube represents a different attribute or field of data, and the cube itself is optimized specifically for quick retrieval and analysis of data. Managing an OLAP cube efficiently, especially with query languages like MDX, requires specialized knowledge.

# History of OLAP Systems

This was enhanced by AI, based on my notes about OLAP/and links

OLAP (Online Analytical Processing) systems have a rich history dating back to the early 1990s. The term “OLAP” was coined by Edgar F. Codd in 1993, the same computer scientist who developed the relational database model. The earliest OLAP systems were designed to overcome the limitations of relational databases for analytical workloads by organizing data into multidimensional structures called “cubes.”

# The Birth of OLAP Cubes (1990s)

The first generation of OLAP tools emerged to address a fundamental business problem: while Relational Databases excelled at transaction processing (OLTP), they struggled with complex analytical queries. Companies like Arbor Software (later acquired by Hyperion, then Oracle) released Essbase in 1992, one of the first dedicated OLAP servers that used multidimensional storage.

These early OLAP systems followed the MOLAP (Multidimensional OLAP) approach, pre-calculating and storing aggregations across various dimensions in specialized data structures. This enabled quick “slice and dice” analysis but at the cost of long processing times and rigid data models.

# The Rise of Microsoft’s SSAS (Early 2000s)

Microsoft SQL Server Analysis Services (SSAS) became a dominant force in the OLAP landscape in the early 2000s. First introduced in SQL Server 7.0 (1998) and significantly enhanced in SQL Server 2000, SSAS popularized OLAP technology by integrating it directly with the Microsoft data platform.

SSAS revolutionized business intelligence by making multidimensional analysis accessible to a wider audience. Its integration with Excel, the world’s most widely used analytical tool, was particularly transformative. Business users could now perform complex slice-and-dice operations and drill-down analysis directly in a familiar spreadsheet environment, without needing to reconstruct business calculations for each report.

However, SSAS cubes suffered from the same limitations as other MOLAP systems of that era: inflexible schemas, long processing times, and difficulty handling very large datasets. These limitations would eventually lead to the need for more dynamic and scalable approaches.

# Column-Oriented Storage (Mid-2000s)

A pivotal shift occurred in the mid-2000s with the emergence of column-oriented storage designs. Unlike traditional row-based storage, column-oriented databases stored each column separately, enabling dramatic performance improvements for analytical workloads that typically access a subset of columns across many rows.

C-Store, an academic project started in 2005 (Data Engineering Whitepapers), and later commercialized as Vertica, demonstrated the power of this approach. By organizing data by columns rather than rows, these systems could:

Read only the specific columns needed for a query, drastically reducing I/OAchieve higher compression ratios since data in a single column tends to be similarEnable more efficient vectorized processing on modern CPUs

This innovation catalyzed a wave of new analytical database designs that would eventually transform the OLAP landscape.

# Vectorization and Modern CPU Architecture (2010s)

The next major evolution came from leveraging modern CPU capabilities. Traditional database engines processed data row-by-row, but modern CPUs with wide SIMD (Single Instruction, Multiple Data) registers could process multiple data points simultaneously.

Vectorized Query Execution, pioneered by systems like MonetDB and later adopted by ClickHouse (developed at Yandex in 2010), processed data in column “chunks” rather than individual values. This approach took full advantage of CPU cache hierarchies, pipelining, and SIMD instructions, delivering order-of-magnitude performance improvements.

ClickHouse, which became open source in 2016, exemplified how these techniques could be combined to create extraordinarily fast analytical databases. Its design enabled processing billions of rows per second on commodity hardware, making real-time analytics feasible at scale.

# From Pre-calculated Cubes to Dynamic Aggregation (2010-2015)

A fundamental shift occurred as systems moved away from pre-calculated cubes toward more dynamic aggregation models. This trend was driven by:

The need for fresher data without long cube processing timesRequirements for more flexible, ad-hoc analysisThe exponential growth in data volumes making full pre-aggregation impractical

New systems like Apache Druid (developed at Metamarkets in 2011) and Apache Pinot (created at LinkedIn in 2013) embraced this approach. They maintained the performance benefits of OLAP through intelligent segment design, indexing, and in-memory processing but offered greater flexibility and real-time capabilities.

These systems were initially designed for specific use cases—Druid for clickstream and event data analysis, Pinot for LinkedIn’s real-time analytics—but demonstrated a new hybrid model that combined OLAP performance with greater adaptability.

# Distributed Architecture and Commodity Hardware (2010-2015)

Another key innovation was leveraging distributed architectures on commodity hardware. Apache Druid, for instance, implemented a distributed shared-nothing architecture that could scale horizontally across inexpensive machines.

This approach enabled:

Parallel processing across many nodesLinear scalability by adding more machinesFault tolerance through data replicationCost-effective scaling using commodity hardware

The combination of column storage, vectorized processing, and distributed architectures paved the way for OLAP systems that could handle petabyte-scale data with sub-second query performance.

# The New Generation: Hybrid OLAP-Warehouse Systems (2020s)

The most recent evolution in the OLAP landscape is the emergence of systems that bridge the gap between traditional OLAP and data warehousing capabilities. Apache Doris (open-sourced by Baidu in 2017) and StarRocks (a 2020 fork of Doris) represent this new generation.

As shown in the GitHub star history, these newer systems gained significant momentum in 2023-2024, despite being relatively recent entrants. Their popularity stems from combining:

OLAP-like performance for analytical queriesData warehouse features like complex joins and ACID transactionsSupport for both mutable and immutable dataIntegration with modern data lake formats and federated queries

These systems aim to deliver “real-time data warehouse” capabilities—combining the speed of OLAP with the flexibility and broader functionality of data warehouses. This convergence addresses the limitations that restricted traditional OLAP systems to specialized use cases.

The continuous development of new OLAP systems reflects ongoing market demands for solutions that balance performance, flexibility, and cost-effectiveness in an increasingly data-driven world. As data volumes grow and real-time analytics become essential for business competitiveness, the evolution of these systems remains a dynamic and active area of innovation.

# What is OLAP?

# Modern vs. Traditional OLAP Cubes

# Postgres become an OLAP system too?

# History of OLAP Systems

# The Birth of OLAP Cubes (1990s)

# The Rise of Microsoft’s SSAS (Early 2000s)

# Column-Oriented Storage (Mid-2000s)

# Vectorization and Modern CPU Architecture (2010s)

# From Pre-calculated Cubes to Dynamic Aggregation (2010-2015)

# Distributed Architecture and Commodity Hardware (2010-2015)

# The New Generation: Hybrid OLAP-Warehouse Systems (2020s)

# Why replace traditional OLAP-Cubes

# What are Traditional OLAP Cube Replacements

# OLAP vs. OLTP

# Further Readings

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签