Iceberg数据湖表格式介绍

Iceberg is a high-performance Data Lake Table Format for huge analytic tables. Iceberg brings the reliability and simplicity of SQL tables to big data while making it possible for engines to safely work with the same tables, at the same time.

The project was originally developed at Netflix to solve long-standing issues with their usage of huge, petabyte-scale tables. It was open-sourced in 2018 as an Apache Incubator project and graduated from the incubator on the 19th of May 2020. Their first public commit was 2017-12-19. More on the story on RW A Short Introduction to Apache Iceberg by Christine Mathiesen Expedia Group Technology Medium.

Also contains a manifest file same as delta lake.

Made for the foundation layer of Data Lake’s.

# Features

Iceberg supports flexible SQL commands to merge new data, update existing rows, and perform targeted deletes. Iceberg can eagerly rewrite data files for read performance, or it can use delete deltas for faster updates.

Schema Evolution

Schema evolution just works. Adding a column won’t bring back “zombie” data. Columns can be renamed and reordered. Best of all, schema changes never require rewriting your table.

Partitioning

Iceberg handles the tedious and error-prone task of producing partition values for rows in a table and skips unnecessary partitions and files automatically. No extra filters are needed for fast queries, and table layout can be updated as data or queries change.

Time Travel

Rollback

Time-travel enables reproducible queries that use exactly the same table snapshot or lets users easily examine changes. Version rollback allows users to quickly correct problems by resetting tables to a good state.

Compaction

Data compaction is supported out-of-the-box and you can choose from different rewrite strategies such as bin-packing or sorting to optimize file layout and size.

These are all existing in Delta Lake as well and are similar to Apache Hudi.

Transaction Log (Open Table Formats) of a Iceberg table simplified from cloudflare blog:

 1 2 3 4 5 6 7 8 910111213141516171819202122232425262728{  "format-version": 2,  "table-uuid": "0195e49b-8f7c-7933-8b43-d2902c72720a",  "location": "s3://my-bucket/warehouse/0195e49b-79ca/table",  "current-schema-id": 0,  "schemas": [    {      "schema-id": 0,      "type": "struct",      "fields": [        { "id": 1, "name": "id", "required": false, "type": "long" },        { "id": 2, "name": "data", "required": false, "type": "string" }      ]    }  ],  "current-snapshot-id": 3567362634015106507,  "snapshots": [    {      "snapshot-id": 3567362634015106507,      "sequence-number": 1,      "timestamp-ms": 1743297158403,      "manifest-list": "s3://my-bucket/warehouse/0195e49b-79ca/table/metadata/snap-3567362634015106507-0.avro",      "summary": {},      "schema-id": 0    }  ],  "partition-specs": [{ "spec-id": 0, "fields": [] }]}

schemas

snapshots

partition-specs

This is very similar to Transaction Log (Delta Lake).

Metadata layer:

Metadata file

Manifest List file

immutable Avro file

Source

Manifest file

Data Layer:

Data files

# History

Pushed by Dremio and Snowflake with their Iceberg Table. Google Cloud announced support as well 2023-06-20.

Tabular (Iceberg) is the independent data platform built by the original creators of Apache Iceberg. Tabular addresses the pain data engineers and data scientists endure fighting the shortcomings of their data infrastructure. Tabular was founded by Netflix alumni Ryan Blue, Dan Weeks and Jason Reid. Acquired by Databricks.

# Features

# History

# Others

# Icebird

# Further Reads

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签