Second Brain: Crafted, Curated, Connected, Compounded on 10月02日
Delta Lake:赋能数据湖的Lakehouse架构
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Delta Lake是一个开源项目,旨在为数据湖构建Lakehouse架构。它在S3、ADLS、GCS和HDFS等现有数据湖之上提供了ACID事务、可扩展的元数据处理,并统一了流式和批量数据处理。Delta Lake通过添加事务日志来增强Parquet文件格式,记录所有事务操作,并支持回滚到历史版本。其核心功能包括Schema验证,确保数据写入的兼容性;Change Data Feed(CDF),追踪数据的增删改;以及对Trino等工具的Manifest文件支持。Delta Lake项目由Linux基金会管理,旨在实现技术独立性,并被广泛支持于AWS、GCP和Azure的多种数据服务中。

🌟 **Lakehouse架构的基石**: Delta Lake的核心价值在于将数据湖转变为Lakehouse,通过在现有数据湖(如S3, ADLS, GCS)之上提供ACID事务、可扩展的元数据管理以及统一的流批处理能力,实现了数据湖的可靠性和高性能。

🗄️ **增强的存储格式与事务日志**: Delta Lake在标准Parquet文件格式的基础上引入了事务日志(_delta_log)。这个日志记录了所有对表的更改操作,使得Delta Lake能够实现ACID事务,并支持数据回滚到任意历史时间点,极大地增强了数据管理的可靠性。

✅ **智能Schema验证与演进**: Delta Lake在写入时会自动验证DataFrame的Schema是否与目标表兼容,确保所有列存在且数据类型匹配。它还支持显式添加列和自动Schema更新,提高了数据质量和处理效率。

🔄 **Change Data Feed(CDF)追踪数据变更**: Delta Lake的Change Data Feed功能允许追踪表中每一行的增、删、改等变更事件。这对于构建需要处理实时数据变化的ETL/ELT流程、向下游系统传输变更数据或建立详细的审计日志至关重要。

🌐 **技术独立与广泛支持**: Delta Lake由Linux基金会管理,确保了其技术独立性,避免了供应商锁定。它已被AWS、Azure、GCP等主流云服务提供商以及Snowflake、Microsoft Fabric等众多平台广泛支持,使其成为跨平台数据架构的灵活选择。

Delta Lake is an open source project that enables building a Lakehouse architecture on top of data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing on top of existing data lakes, such as S3, ADLS, GCS, and HDFS.

Some references:

# Architecture and Storage

Databricks stores files in distributed object storage such as AWS S3, Google Cloud Storage or Azure Blob Storage in an open-source table format: Apache Parquet. Parquet is the de facto standard for storing column-oriented, compressed data. The encoded db-schema and the availability for multiple languages make it a great choice to store data for big data use cases.

The difference between the Delta Lake table format and a native Parquet file is the addition of the Transaction Log (Delta Lake). Furthermore, Delta Lake unifies different Parquet-Files with the same db-schema into a single folder, essentially the path to a Delta Table. Included in that folder is an automatically generated separate folder for the transaction log represented by _delta_log, which logs the ACID transactions happening on a table and allows neat features such as rolling back the data to a previous point in time.

# File Format Implementation

See Data Lake File Format.

# Understanding the Sync Process

Let’s look at what files Airbyte created behind the scenes on the S3 data storage, originally from Load Data into Delta Lake on Databricks Lakehouse.

I used CyberDuck to connect to S3, but you can connect via cmd-line or the web console. Below is an image that illustrates how that looks:

An illustration of a Databricks Lakehouse destination on AWS S3

Airbyte created several files. Most notably, the *.snappy.parquet and the _delta_log contain the above-mentioned delta lake transaction log. The content of your source data resides in the parquet files consolidated in a delta table represented as a single folder in the tutorial_cities and tutorial_users tables. Each table folder contains the delta-log with detailed information about each transaction.

There are other internal folders that Databricks created, such as copy_into_log and SSTs produced by the COPY INTO command.

The Airbyte Databricks Lakehouse connector does at its core two following steps:

    CREATE OR REPLACE TABLE AS SELECT * FROM LIMIT 0COPY INTO 

These two transactions are also represented in the transaction log with 00000000000000000000.json/crs and 00000000000000000001.json/crc. If you open the transaction log represented by the two JSON and CRC files, you should see a protocol for each transaction that happened and some metadata.

00000000000000000000.json:

 1 2 3 4 5 6 7 8 9101112131415161718
{    "commitInfo": {        "timestamp": 1654879959245,        "operation": "CREATE TABLE AS SELECT",        "operationParameters": {            "isManaged": "false",            "description": "Created from stream tutorial_cities",            "partitionBy": "[]",            "properties": "{"delta.autoOptimize.autoCompact":"true","delta.autoOptimize.optimizeWrite":"true","airbyte.destinationSyncMode":"append"}"        },        "isolationLevel": "WriteSerializable",        "isBlindAppend": true,        "operationMetrics": {            "numFiles": "1",            "numOutputBytes": "927",            "numOutputRows": "0"        }    }

00000000000000000000.crc:

1
{"tableSizeBytes":927,"numFiles":1,"numMetadata":1,"numProtocol":1,"numTransactions":0}

And the second transaction 00000000000000000001.json with COPY INTO:

 1 2 3 4 5 6 7 8 9101112131415
{    "commitInfo": {        "timestamp": 1654879970063,        "operation": "COPY INTO",        "operationParameters": {},        "readVersion": 0,        "isolationLevel": "WriteSerializable",        "isBlindAppend": true,        "operationMetrics": {            "numFiles": "1",            "numOutputBytes": "2085",            "numOutputRows": "3"        }    }}

00000000000000000001.crc:

1
{"tableSizeBytes":3012,"numFiles":2,"numMetadata":1,"numProtocol":1,"numTransactions":1}

# Key Features

General features are on Data Lake Table Format

# Schema Validation

Delta Lake automatically validates DataFrame schema compatibility with table schema during writes, following these rules:

    All DataFrame columns must exist in the target tableDataFrame column data types must match target table typesColumn names cannot differ only by case

Delta Lake on Databricks supports DDL for explicit column addition and automatic schema updates. When using append mode with partitionBy, Delta Lake validates matching configurations and throws errors for mismatches. Without partitionBy, it automatically appends to existing data partitioning.

Source: What is Delta Lake? | Databricks on AWS

Delta Lake employs schema validation on write, ensuring all new writes are schema-compatible before committing. Incompatible schemas trigger transaction cancellation and raise exceptions. The compatibility rules require that the DataFrame:

    Cannot contain additional columns absent from the target tableMust have matching column data typesCannot contain case-different column names

Source: Schema Evolution & Enforcement on Delta Lake - Databricks

# Change Data Feed (CDF)

Change Data Feed (CDF) feature allows Delta tables to track row-level changes between versions of a Delta table. When enabled on a Delta table, the runtime records “change events” for all the data written into the table. This includes the row data along with metadata indicating whether the specified row was inserted, deleted, or updated.

You can read the change events in batch queries using DataFrame APIs (that is, df.read) and in streaming queries using DataFrame APIs (that is, df.readStream).

# Manifest Files

For integration with Trino and similar tools, Delta Lake typically works with a manifest file. See Presto, Trino, and Athena to Delta Lake integration using manifests.

# Use cases

Change Data Feed is not enabled by default. The following use cases should drive when you enable the change data feed.

    Silver and Gold tables: Improve Delta performance by processing only row-level changes following initial MERGEUPDATE, or DELETE operations to accelerate and simplify ETL and ELT operations.Transmit changes: Send a change data feed to downstream systems such as Kafka or RDBMS that can use it to incrementally process in later stages of data pipelines.Audit trail table: Capture the change data feed as a Delta table provides perpetual storage and efficient query capability to see all changes over time, including when deletes occur and what updates were made.

# History

# Databricks Acquires Tabular

2024-06-04 Ali Ghods announced the acquisition of Tabular (Iceberg), the company behind Apache Iceberg.

origin

# Version Updates

# FAQ

https://www.ssp.sh/brain/Delta%20Lake-1739702716628.webp17https://www.ssp.sh/brain/Delta%20Lake-1739702692242.webprain/Delta%20Lake-1739702692242.webp">

# Vendor Independence

Delta Lake maintains independence through:

Delta is not only used by Databricks and there is no vendor locking as Delta Lake project is managed by the Linux Foundation. Delta Lake is supported on AWS Lake Formation, Athena, Redshift (even long time before the support for Iceberg was added there), and EMR. Snowflake does also have some support for Delta Lake. GCP BigQuery also does have some support for Delta Lake. Azure ADF, Synapse, also support Delta Lake. Microsoft Fabric is built on top of Delta Lake. Many other tools and services support Delta. So I don’t think you need Databricks or vendor locked there as you’re suggesting, if you go the Delta way. recap on Iceberg Summit 2024 conference: r/dataengineering


Origin: Data Lake / Lakehouse Guide: Powered by Data Lake Table Formats (Delta Lake, Iceberg, Hudi) | ssp.sh
References: Medallion Architecture
Created: 2019 when Spark & AI Summit Amsterdam 2019

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Delta Lake Lakehouse 数据湖 ACID事务 Parquet Schema验证 Change Data Feed 开源
相关文章