Second Brain: Crafted, Curated, Connected, Compounded on 10月02日 21:00
经典数据仓库架构详解
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了经典数据仓库架构,从数据 staging 到清洗、核心层和数据集市,并探讨了不同架构模型如 Kimball、Medallion 和 Azure 的异同。文章强调一致性原则,并通过现代数据架构示例展示了数据流、建模工具(如 dbt)和元数据的重要性,旨在帮助读者理解数据仓库分层设计的核心价值。

💡 经典数据仓库架构采用 Staging → Cleansing → Core → Marts 的分层设计,其中 Staging 层作为原始数据的临时存储点,Cleansing 层负责数据清洗和转换,Core 层整合主题化数据供数据集市使用,Marts 则为特定用户群体提供优化查询的子集。

🔄 文章指出,虽然不同架构模型(如 Kimball、Medallion、Azure)在命名和细节上存在差异,但核心思想一致:通过分层实现任务专业化,简化数据清洗、集成、历史化和查询处理,其中 Medallion 架构引入了语义层或 OLAP 立方体,Azure 则使用 raw → conformed → enriched → curated 的命名。

🛠️ 现代数据架构结合了 dbt 等建模工具和 SCD2、维度建模技术,通过 staging → intermediate → marts 的流程实现数据整合。文章以具体示例展示了数据流(如 Postgres 源数据经 Arrow 集成后加载到 staging 层,再通过 Merge SCD2 处理形成 Core 数据),并强调 Orchestration(如 LoadDag 和 ModelingDag)对数据加载和建模的调度作用。

📊 元数据是数据仓库的基础,包括业务元数据(描述属性、钻取路径、聚合规则)、技术元数据(数据结构、映射规则、ETL 参数)和操作元数据(日志表、错误信息、ETL 过程日志),它们共同构成了数据仓库的运行框架,确保数据质量和可追溯性。

🔍 文章通过 grocery store 的类比(如 loading dock、preparation area、market stalls)生动解释了各层功能:Staging 是供应商卸货的码头,Cleansing 是商品上架前的质检,Core 是主题化的仓库,Marts 是细分市场的摊位,这种比喻有助于理解分层设计的实际应用场景。

This is the classical data warehouse architecture I learned at the beginning of my career. And to this day, I like to model my data warehouse this way.

Going from Staging -> Cleansing -> Core -> Marts:

From Data Warehouse Blueprints: Business Intelligence in der Praxis : Schnider, Dani, Jordan, Claus, Welker, Peter, Wehner, Joachim: Amazon.de: Bücher, September 2016, created Trivadis. ^437a59

# Other Views - All the same?

I almost think it does not matter which model or playbook you follow, as long as you do it consistently and all follow the same. So having the dbt, the Kimball DW and BI Lifecycle Methodology and others as the documented goal, can make everything much simpler.

Below you find some more to follow.

# Medallion Architecture

Other views, e.g. from the Medallion Architecture
Image source from
Behind the Hype - The Medallion Architecture Doesn’t Work - YouTube

A modern take with a Semantic Layer or OLAP Cubes, almost like the old days in the above clhttps://www.ssp.sh/brain/Medallion%20Architecture#^b4b488aihttps://www.ssp.sh/brain/img_Medallion%20Architecture_1757666768975.webp/brain/img_Medallion%20Architecture_1757666768975.webp">

# Azure

Same architecture, different names. In the Azure world, recommendations are: raw → conformed (cleaned) → enriched → curated based on Data lake zones and containers - Cloud Adoption Framework | Microsoft Learn. Thanks to Dominik Hron

# dbt

dbt uses staging → intermediate → marts, see their definitions in How we structure our dbt projects.

What is your go-to modeling architecture? I like the one below; if it’s complex, I’d like to add cleansing between staging and core.

# Modern Data Architecture and Orchestration

The example contains tools such as orchestration, modeling, data ingestion and Data Modeling Techniques such as SCD2 and Dimensional Modeling (Kimball):

flowchart TB    subgraph DWH_Architecture["Modern Data Stack Architecture"]        direction LR        subgraph DataFlow["Data Flow"]            direction LR            subgraph Source["Data Source"]                DB[(Source Postgres)]                TableA[Table_A]                TableAHistory[Table_A_History]                WALLogs[WAL-Logs]                                DB --- TableA                DB --- TableAHistory                DB --- WALLogs            end            subgraph Integration["Data Integration (dlt)"]                Arrow["Arrow (in-memory)"]                Norm["normalization schema"]                Load["load"]                LoadTables["Adding Load-Tables"]                                Arrow --> Norm                Norm --> Load                Load --> LoadTables            end            subgraph Warehouse["Data Warehouse"]                direction LR                subgraph LZ["staging (stg)"]                    Address["Address\n...\nload_date"]                    AddressHist["Address_History\n...\nload_date"]                    Car["Car\n...\nload_date"]                    CarHist["Car_History\n...\nload_date"]                end                subgraph Core["core (historization)"]                    AddressCore["Address\nvalid_from\nvalid_to"]                    CarCore["Car\nvalid_from\nvalid_to"]                    TheoryExam["Theory_Exam\ncreated_at"]                    LicensePlate["License_Plate\nvalid_from\nvalid_to"]                    MergeSCD2["Merge SCD2"]                end                subgraph Mart["data mart (Star Schema)"]                    direction LR                    subgraph Facts["Fact Tables"]                        CarFact["fact_car\n- metrics\n- foreign keys"]                        TheoryFact["fact_theory_exam\n- metrics\n- foreign keys"]                    end                    subgraph Dims["Dimension Tables"]                        DimAddress["dim_address\n- attributes\n- surrogate key"]                        DimLicense["dim_license_plate\n- attributes\n- surrogate key"]                        DimDate["dim_date\n- date attributes\n- surrogate key"]                    end                end            end        subgraph Tools["Tools & Orchestration"]            direction TB            subgraph Orchestration["Orchestration"]                LoadDag["Load DAG"]                ModelingDag["Modeling DAG"]                LoadDag --> ModelingDag            end            subgraph Modeling["dbt: Data Modeling"]                LZModels["Landing Zone Models"]                CoreModels["Core Models"]                MartModels["Mart Models"]                LZModels --> CoreModels                CoreModels --> MartModels            end        end        end    end    %% Main data flow connections    Source --> Integration    Integration --> LZ    Address --> AddressHist    Car --> CarHist    LZ --> MergeSCD2    MergeSCD2 --> AddressCore    MergeSCD2 --> CarCore    MergeSCD2 --> TheoryExam    MergeSCD2 --> LicensePlate    AddressCore --> CarFact    AddressCore --> TheoryFact    CarCore --> CarFact    LicensePlate --> CarFact    TheoryExam --> TheoryFact    %% Dimension relationships    CarFact --- DimAddress    CarFact --- DimLicense    CarFact --- DimDate    TheoryFact --- DimAddress    TheoryFact --- DimDate    %% Data Modeling Flow    LZ -.->|"dbt run"| LZModels    Core -.->|"dbt run"| CoreModels    Mart -.->|"dbt run"| MartModels    %% Orchestration    LoadDag -.->|"triggers"| Integration    ModelingDag -.->|"triggers"| Modeling    %%classDef source fill:#1a202c,color:#ffffff      %% Dark blue-gray with white text    %%classDef integration fill:#2d3748,color:#ffffff %% Slightly lighter blue-gray    %%classDef warehouse fill:#2a4365,color:#ffffff   %% Navy blue    %%classDef core fill:#2c5282,color:#ffffff        %% Medium blue    %%classDef mart fill:#2b6cb0,color:#ffffff        %% Light blue    %%classDef modeling fill:#1a365d,color:#ffffff    %% Dark blue    %%classDef orchestration fill:#233876,color:#ffffff %% Royal blue    class Source source    class Integration integration    class Warehouse warehouse    class Core core    class Mart mart    class Modelhttps://www.ssp.sh/brain/img_Classical%20Architecture%20of%20Data%20Warehouse_1757666633986.webp/img_Classical%20Architecture%20of%20Data%20Warehouse_1757666633986.webp">

# Layers

In this overview, we’ll delve into each layer of a complete Data Warehouse (DWH) architecture and explore why this modeling approach is effective:

# Staging Area

Staging Area: This initial layer serves as the landing point for data from various source systems.

    Here, data is stored in its original form, as delivered.The structure of the staging tables mirrors the source system interfaces, with no inter-table relationships.Each table holds data from the latest delivery, which is cleared before the next batch.Think of it like a grocery store’s loading dock, where suppliers (source systems) drop off goods (data) for temporary storage.

# Cleansing Area

Cleansing Area: Prior to integration into the Core, data undergoes cleaning in the Cleansing Area.

    This involves filtering out, correcting, or supplementing faulty data.Data from different sources is transformed and unified.Similar to the Staging Area, only the most recent data batch is kept here.A grocery store analogy would be the preparation area where goods are made ready for sale, undergoing quality control and labeling.

# Core

Core: Data from various sources converges in the Core, having passed through the Staging and Cleansing areas, and is stored long-term, often for years.

    The Core’s main function is integrating data from diverse sources and organizing it thematically.It’s the go-to source for Data Marts and should generally not be accessed directly by users.

# Data Mart

Data Mart: Marts store subsets of Core data, optimized for user queries.

    Each Mart caters to specific applications or user groups, simplifying query complexity and enhancing system usability.These can be likened to specialized market stalls in a grocery store, each offering a curated selection of goods.

Metadata: The foundation of the DWH system, metadata, is essential for its smooth operation.

    Business metadata includes descriptions of attributes, drill paths, and aggregation rules for front-end applications.Technical metadata covers data structures, mapping rules, and ETL control parameters.Operational metadata encompasses log tables, error messages, ETL process logs, and more, essentially forming the DWH’s infrastructure.

While not every Data Warehouse adheres strictly to this structure, with some areas merged or renamed, the essential concept is to segment the system for task specialization. This segmentation facilitates data cleaning, integration, historization, and query handling, simplifying the transformation processes between layers.


Origin:
References: Medallion Architecture, One Big Table (OBT), Data Modeling
Created 2023-04-27

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

数据仓库 数据建模 Kimball Medallion Architecture dbt ETL 元数据 数据集市 数据仓库架构 Data Warehouse
相关文章