Second Brain: Crafted, Curated, Connected, Compounded on 10月02日 21:00
数据建模:数据工程的基石
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文深入探讨了数据建模在数据工程中的核心作用,强调了其作为组织数据结构化表示的蓝图,对于设计数据仓库、数据湖等分析解决方案至关重要。文章回顾了数据建模的演变,从早期的Inmon与Kimball之争,到如今多层次、多方法的复杂体系。作者区分了不同级别的数据建模(如逻辑数据模型到物理数据模型),并强调了在设计过程中需要考虑的细节,包括维度、事实、分区、索引、关系、约束,以及冗余、反规范化、数据粒度等,最终目标是优化数据质量和查询效率,以满足商业价值的需求。

📊 **数据建模是数据工程的基石**:数据建模通过创建数据结构化表示,清晰地描绘了组织数据的关系、约束和模式,为设计数据仓库、数据湖和分析解决方案提供了关键的蓝图,确保数据能够高效、结构化地流动,并保持高质量和最小化冗余。

🏗️ **多层次与多方法的数据建模**:随着技术发展,数据建模已演变为包含多个层次和方法的复杂体系。从最初的Inmon与Kimball之争,到如今涉及数据源设计、ETL、数据仓库 schema、数据湖构建、BI展现层设计、机器学习特征工程等各个环节,体现了其广泛的应用范围。

🔧 **从逻辑到物理的设计细节**:数据建模设计过程涉及从逻辑数据模型到物理数据模型的转化,需要详细定义维度、事实表、分区、索引、关系和约束。同时,必须关注数据冗余、反规范化、数据粒度以及如何优化数据以支持高效的仪表板、即席查询和数据应用,确保整个过程中的数据质量。

✍️ **建模语言与工具的演变**:数据建模正日益演变为一种“语言”的艺术,不同的数据角色(如数据科学家、数据工程师)使用不同的术语。建模过程也从在数据平台内部进行,更多地转向在纸上、Excalidraw或专业建模工具中进行逻辑建模,再通过dbt等工具在物理层面实现,dbt以其SQL基础和内置文档功能成为重要选择。

Data modeling has changed; when I started (~20 years ago), choosing between Inmon and Kimball was common.

Today, in data engineering, data modeling creates a structured representation of your organization’s data. Often illustrated visually, this representation helps understand the relationships, constraints, and patterns within the data and serves as a blueprint for gaining business value in designing data systems, such as data warehouses, lakes, or any analytics solution.

In its most straightforward form, data modeling is how we design the flow of our data such that it flows as efficiently and in a structured way, with good data quality and as little redundancy as possible.

Abstract

Data modeling and design are key to creating usable data systems. Designing involves different high-level modeling, such as generation or source database design, data integration, ETL processes, data warehouse schema creation, data lake structuring, BI tool presentation layer design, machine learning, or AI feature engineering.

Further details design involves moving from a Logical Data Model to a Physical Data Model, defining tables with dimensions and facts, partitioning and indices, relationships, constraints, etc. Further, the architect needs to be aware of redundancy and denormalization, the grain of the table, and data optimized for efficient queries of dashboards, ad-hoc notebooks, or data apps, ensuring data quality throughout the process.

Like to Read as a Blog article?

These here are my evolving notes about modeling. If you like a more condences write-up, check out my three-part write-up about Data Modeling – The Unsung Hero of Data Engineering. Otherwise, continue here.

# Different Levels of Data Modeling

What do you think about different levels of modeling? Generally, when I started (20 years ago), choosing between Inmon and Kimball was common. But today, there are so many layers, levels, and approaches. Did you find a good way of separating or naming the different “levels” (still unsure about levels) to clarify what is meant? Below is a list of what I think so far (I also wrote extensively about it, in case of interest).

LinkedIn Post and Discussion, X/Twitter and dbt Slack. Links (from the post): Data Model Matrix.

# Different Data Modeling Techniques

See Data Modeling Techniques

# (Design) Patterns

Common approaches are well explained here:

others

# Data Modeling is changing

Data Modeling is as much about Data Engineering Architecture as it is about modeling the data only. Therefore besides the below links, many approaches and common architecture you can find in Data Engineering Architecture.

It’s getting more about language than really modeling, Shane Gibson says on Making Data Modeling Accessible. For example, a Data Scientist speaks Wide Tables, a Data engineer talks about facts and dimensions, etc., it’s what I call the different levels of data modeling.

See more on Data Modeling is changing.

See Data Modeling Tools.

Mostly, modeling happens outside the data platform (Logical Data Model). I’d argue you should model on Paper, Excalidraw, or sophisticated model tools (see more on Data Modeling: Architecture Pattern, Tools and the Future (Part 3)) before integrating into any tool.

The next layer (Physical Data Model) would be dbt as the tool I’d use to implement the modeled architecture. It’s SQL, everybody understands, and you get documentation out of the box. Integrated into Dagster, you get a high-level data-flow model from tables to data assets.

More on Logical vs. Physical see Conceptual, Logical to physical Data Models.

Tweet

# Data Modeling Languages

See Data Modeling Languages, or also on Data Model Engines.

# Data Modeling Frameworks

See Data Modeling Frameworks

See Data Modeling – The Unsung Hero of Data Engineering- Modeling Techniques and Data Architecture Patterns (Part 2).

There is more than dimensional modeling:

    hierarchies, semistructured sources, conformed dimensions, historical updates, and the logic used to keep them up to dateSource: Serge Gershkovich on LinkedIn

# Data Modeling part of Data Engineering?

Data modeling, incredibly Dimensional Modeling with defining facts and dimensions, is a big thing for a data engineer, IMO. It would help if you asked vital questions to optimize for data consumers. Do you want to drill down the different products? Daily or monthly enough —keywords Granularity and rollup.

It also lets you consider Big-O implications regarding how often you touch and transfer data. I’d recommend the old Data Warehouse Toolkit from Ralph Kimball, which initiated many of these concepts and is still applicable today. Mostly, it’s not done in the beginning, but as soon as you get bigger, you wish you had done more :)

Relates to .

Links:

# Further Readings

I wrote an extensive three-part series about data modeling; check it out on my blog:

    An Introduction to Data ModelingModeling Approaches and TechniquesArchitecture Pattern, Tools and the FutureBook: Data Modeling with Snowflake (Serge Gershkovich)

Origin:
References: enterprise architecture modeling
Created 2022-09-24

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Data Modeling Data Engineering Data Warehouse Data Lake ETL Schema Design Logical Data Model Physical Data Model dbt 数据建模 数据工程 数据仓库 数据湖 数据架构
相关文章