Second Brain: Crafted, Curated, Connected, Compounded on 10月02日
One Big Table数据模型解析
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

One Big Table (OBT) 是一种宽表数据模型,将事实数据和大部分维度属性整合在同一张表中,减少连接需求,提升查询效率。这种模型利用现代云数据库的强大计算能力,简化数据流程,但需注意管理维度变化、计算成本和数据冗余问题。OBT适用于需要扁平数据集的场景,如CSV导出或关系支持不足的BI工具,而支持关系的工具则更适合维度模型。

😔 One Big Table (OBT) 是一种将事实表与大部分维度属性整合为一张宽表的数据模型,通过减少连接操作简化查询过程,显著提升查询响应速度,尤其借助列编码技术无需连接即可实现。

📈 OBT的核心优势在于其查询效率的提升,得益于存储成本的降低和现代云数据库的强大计算能力,使得宽表模型得以高效运行,数据可直接从 staging 层流转至最终报告层,无需中间层。

🔄 管理OBT中的维度变化是一大挑战,需在最新维度值与时间点值之间做出权衡,同时新增维度属性通常需要回填历史数据(除非使用Apache Iceberg等支持变更数据捕获的技术)。

⚠️ OBT模型可能导致数据冗余并增加对源系统的依赖,限制数据建模的灵活性,因此在设计时需权衡计算成本,避免因复杂查询和业务逻辑导致资源消耗过高,并制定合理的分层建模策略。

And extension of Kimball, suitable for long-term analysis. Essentially a denormalized table, akin to what is found in a Data Mart. And also synonym with a Wide Table.


Data Modeling in the Modern Data Stack - YouTube

On Normalization vs Denormalization, One Big Table is very much on the right with most denormalization.

Synonyms

The OBT technique is nothing new; at the start of my career, we built data marts with materialized views and used them in similar ways to OBT today.

They are very similar to initially discussed MVs and to data martsWide TablesSuper TableSnapshotting, or just any denormalized table.

# History

The term “One Big Table” (OBT) doesn’t seem to have a single point of origin. This means it either came up from the natural language to make one big table, therefore also the close connection to existing terms, or someone tried to create a new marketing term. I initially found the first appearance in 2021 and later in 2022.

But it might have come from BigQuery with its insufficient joining capabilities. Google suggests people put everything in one big table[^note1]:

“February 2028, Josh Andrews, dbt Slack

I haven’t looked at Bigquery in a while. Still, when I spoke to the Google team a couple of years ago about building a “true” data warehouse with facts and dimensions and asked for their recommendations about data modeling, they said essentially, “Just denormalize everything and put it in one big table”. Also see what Bora Beran has to say, Bigtable Product Lead at Google. 

From there, it might have been a common term used in the community for the denormalized or wide table or even “Super Table”. On the other hand, Super Table has been announced and will probably be introduced here in 2022. If you have more information about the origin, I’d be happy to hear from you.

# Pros and Cons

One Big Table refers to a facts table that includes most of its dimensional attributes as additional columns.

# Pros:

    Simplifies querying by reducing the need for multiple joins.Enhances query response time, notably with column encoding and without joins.

# Cons:

    Challenges arise in managing changing dimensions within OBT.Decisions between using the latest dimension values or vs. point-in-time values.Adding new dimensional attributes requires backfilling (not if using Apache Iceberg, etc)
    Source: Joseph M. on LinkedIn

Reply by Jonathan Neo:

Regarding one of the cons: 

Need help managing changing dimensions in OBT. 

If you’re creating an OBT on top of a Dimensional Model and you have created a fact and dimension with SCD Type 2 handling, then the OBT will inherit SCD Type 2 properties. 

This is why I prefer to build OBT models on top of dimensional models as it facilitates D.R.Y patterns, and reuse of dimension tables. 

Personally, I reserve OBT for tools that require a flat dataset (e.g. a CSV export, or a BI tool that does not support relationships like Preset). If the BI tool supports relationships (e.g. Power BI, Tableau), then I would directly consume the dimensional model.

The One Big Table approach, or OBT, represents a shift towards wide, denormalized models. The core ideas here are that storage is inexpensive, allowing for broad tables, and modern cloud databases have such robust computational capabilities that they can efficiently handle these extensive models. This approach streamlines the process, moving data directly from staging to the final reporting mart without intermediate layers.

However, this efficiency can come at a cost. Complex queries and extensive business logic can drive up computational expenses, particularly in large-scale enterprises. Additionally, this model can lead to data redundancy and an over-reliance on source systems, limiting flexibility in Data Modeling. It’s crucial to strategize the modeling of each layer to avoid data overload and maintain control over your pipelines.

In summary, while OBT offers quick setup and operational simplicity, it requires careful consideration of computational costs and strategic data modeling to avoid scalability issues and data management complexities.

# References

Explore more in Data Modeling and my online book about Materialized Views vs. One Big Table (OBT) vs. dbt tables vs. Traditional OLAP vs. DWA - 📖 Data Engineering Design Patterns (DEDP).


Origin: Data Modeling in the Modern Data Stack - YouTube, Star Schema vs. OBT for Data Warehouse Performance | Blog | Fivetran
References:
Created 2022-10-14

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

One Big Table Data Modeling Denormalization Wide Table Data Engineering Cloud Databases
相关文章