Second Brain: Crafted, Curated, Connected, Compounded on 10月02日 21:00
Databricks Medallion Architecture解析
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Databricks的Medallion架构并非严格意义上的架构,而是一种包含青铜、白银和黄金三个数据阶段的数据处理模式。青铜层存储原始数据,白银层进行清洗和转换,黄金层构建分析模型。该架构简单易懂,支持增量ETL,并能根据需求灵活调整。它被视为传统数据仓库架构的现代化演进,适用于数据湖环境。

🏙️ 青铜层:存储原始数据,采用Table Formats格式,支持快速Change Data Capture,提供历史归档和数据血缘追踪功能。

🌟 白银层:包含从青铜层清洗和转换后的数据,应用‘恰到好处’的转换,提供‘企业级’关键业务实体视图,支持自助式分析。

🔮 黄金层:实现分析模型,提供消费就绪、项目特定的数据库,使用反规范化、读优化的数据模型,应用最终转换和数据质量规则,通常采用星型模式。

⏳ 支持增量ETL和‘时间旅行’能力,允许从原始数据随时重建表格,并支持ACID事务。

🔄 灵活架构可适应特定需求,例如增加更多层级,并基于对象存储和简单分布式文件构建简单易懂的数据模型。

The databricks medallion architecture is not really an architecture, but more Approach or Pattern with three data stages: bronze, silver, and gold.


Image Source by Jorrit Sandbrink

    The bronze data stage stores the data in its original form from source systems. It uses Table Formats for storage, focuses on quick Change Data Capture, and provides historical archiving and data lineage.The silver data stage contains cleaned and transformed data blended from the bronze stage. It applies “just enough” transformations, provides an “Enterprise view” of key business entities, and enables self-service analytics.The gold data stage implements the analytics model with consumption-ready, project-specific databases. It uses denormalized, read-optimized data models and applies final transformations and data quality rules. The data model is usually a Star Schema, with Facts (transactional data) and Dimensions (descriptive attributes) typically defined and optimized at this layer.

Data flows through the layers from dirty to clean, normalized to denormalized, and granular to aggregated. The gold layer often represents the final stage of this transformation.

# Benefits and Considerations

    Simple and easy-to-understand data model based on Object-Store and simple distributed files.Enables incremental ETLAllows recreation of tables from raw data at any timeSupports ACID transactions and time travel capabilitiesFlexible architecture that can be adapted to specific needs (e.g., adding more layers)

# Historical Context

The Medallion Architecture, created and announced by Databricks, can be seen as an evolution of Classical Architecture of Data Warehouse, with layers such as stage -> cleansing -> core -> mart, but optimized for Data Lakes and modern data processing needs.

Simon Whiteley has another great overview that combines the two and argues that every company has different requirements and, therefore, different layers. Not each layer of the medallion architecture must have only one layer, as shown in the image, it can contain multiphttps://www.ssp.sh/brain/Medallion%20Architecture-20241009223059518.webp1009223059518.webp">
Image source from Behind the Hype - The Medallion Architecture Doesn’t Work - YouTube

# Mermaid Platinum Layer Modeling

graph LR    %% Input sources    B[Batch] --> Bronze    S[Streaming] --> Bronze        %% Main flow    subgraph Bronze[Bronze Layer]        B1[Raw Integration
Landing zone
No schema needed] end subgraph Silver[Silver Layer] S1[Filtered, Cleaned
Augmented
Define & evolve schema] end subgraph Gold[Gold Layer] G1[Business-oriented
Denormalized
Clean data delivery] end subgraph Platinum[Platinum Layer] P1[Semantic Layer
Aggregated
Sub-seconds] end %% Connections between layers Bronze -->|cleaning|Silver Silver -->|organize| Gold Gold --> |curate|Platinum %% Output connections Platinum --> Excel[Excel] Platinum --> BI[BI] Platinum --> ML[ML/AI] Platinum --> Apps[Data Apps] %% Styling classDef default fill:#f9f9f9,stroke:#333,stroke-width:1px classDef platinum fill:#5f9ea0,color:white class Platinum platinum

^b4bhttps://www.ssp.sh/brain/img_Medallion%20Architecture_1757666768975.webpon%20Architecture_1757666768975.webp">

# Implementation

Databricks provides tools like Delta Live Tables (DLT) that allow users to build data pipelines with Bronze, Silver, and Gold tables using minimal code. These pipelines can be built on Apache Spark Structured Streaming for real-time data processing.

# Variations

As with the Classical Architecture of Data Warehouse, Medallion Architectures can vary in their layers.

    Sometimes, you have two, three, or a maximum of four, depending on how much cleaning and complex data sources you haveSome tables might not fit perfectly into a single layerAdaptable to specific organizational needs and data complexities

Origin: Iceberg + Spark + Trino + Dagster: modern, open-source data stack demo | by ZD | Jul, 2022 | Dev Genius
References: Trivadis Data Warehouse Layers.
Created 2022-08-16

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Databricks Medallion Architecture Data Engineering Data Lake ETL Bronze Layer Silver Layer Gold Layer
相关文章