Second Brain: Crafted, Curated, Connected, Compounded on 10月02日
Iceberg数据湖表格式介绍
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Iceberg是一种高性能数据湖表格式,专为大型分析表设计。它由Netflix开发,于2018年作为Apache Incubator项目开源,并于2020年正式毕业。Iceberg提供了SQL的可靠性、模式演化、隐藏分区、时间旅行和回滚、数据压缩等特性,使大数据处理更简单高效。它通过事务日志记录表状态变化,包含元数据文件、清单文件和Manifest文件,支持Parquet/ORC/AVRO格式的数据文件存储。Iceberg已成为Databricks等平台支持的标准,旨在解决数据湖中的管理难题。

💡Iceberg是一种高性能数据湖表格式,由Netflix开发并开源,支持SQL可靠性、模式演化、隐藏分区、时间旅行和回滚、数据压缩等关键特性,使大数据处理更简单高效。

📂Iceberg通过事务日志记录表状态变化,包含元数据文件(存储模式、分区、快照信息)、清单文件(索引Manifest文件)和Manifest文件(列出数据文件/删除文件、分区元组、指标),实现数据版本管理和查询优化。

⏳Iceberg支持时间旅行功能,允许用户回溯到特定时间点的表快照进行查询,并提供版本回滚能力,可快速将表重置到良好状态以纠正问题。

🔄Iceberg实现全模式演化,添加或重命名列不会产生‘僵尸’数据,支持列重排序,模式变更无需重写底层文件,简化了数据管理流程。

🚀Iceberg采用隐藏分区机制,自动处理分区值生成并跳过不必要的分区和文件,无需额外过滤即可实现快速查询,且表布局可根据数据或查询变化进行更新。

Iceberg is a high-performance Data Lake Table Format for huge analytic tables. Iceberg brings the reliability and simplicity of SQL tables to big data while making it possible for engines to safely work with the same tables, at the same time.

The project was originally developed at Netflix to solve long-standing issues with their usage of huge, petabyte-scale tables. It was open-sourced in 2018 as an Apache Incubator project and graduated from the incubator on the 19th of May 2020. Their first public commit was 2017-12-19. More on the story on RW A Short Introduction to Apache Iceberg by Christine Mathiesen Expedia Group Technology Medium.

Also contains a manifest file same as delta lake.

Made for the foundation layer of Data Lake’s.

# Features

    Expressive SQL
      Iceberg supports flexible SQL commands to merge new data, update existing rows, and perform targeted deletes. Iceberg can eagerly rewrite data files for read performance, or it can use delete deltas for faster updates.
    Full Schema Evolution
      Schema evolution just works. Adding a column won’t bring back “zombie” data. Columns can be renamed and reordered. Best of all, schema changes never require rewriting your table.
    Hidden Partitioning
      Iceberg handles the tedious and error-prone task of producing partition values for rows in a table and skips unnecessary partitions and files automatically. No extra filters are needed for fast queries, and table layout can be updated as data or queries change.
    Time Travel and Rollback
      Time-travel enables reproducible queries that use exactly the same table snapshot or lets users easily examine changes. Version rollback allows users to quickly correct problems by resetting tables to a good state.
    Data Compaction
      Data compaction is supported out-of-the-box and you can choose from different rewrite strategies such as bin-packing or sorting to optimize file layout and size.

These are all existing in Delta Lake as well and are similar to Apache Hudi.

Transaction Log (Open Table Formats) of a Iceberg table simplified from cloudflare blog:

 1 2 3 4 5 6 7 8 910111213141516171819202122232425262728
{  "format-version": 2,  "table-uuid": "0195e49b-8f7c-7933-8b43-d2902c72720a",  "location": "s3://my-bucket/warehouse/0195e49b-79ca/table",  "current-schema-id": 0,  "schemas": [    {      "schema-id": 0,      "type": "struct",      "fields": [        { "id": 1, "name": "id", "required": false, "type": "long" },        { "id": 2, "name": "data", "required": false, "type": "string" }      ]    }  ],  "current-snapshot-id": 3567362634015106507,  "snapshots": [    {      "snapshot-id": 3567362634015106507,      "sequence-number": 1,      "timestamp-ms": 1743297158403,      "manifest-list": "s3://my-bucket/warehouse/0195e49b-79ca/table/metadata/snap-3567362634015106507-0.avro",      "summary": {},      "schema-id": 0    }  ],  "partition-specs": [{ "spec-id": 0, "fields": [] }]}
    schemas: Iceberg tracks schema changes over time. Engines use schema information to safely read and write data without needing to rewrite underlying files.snapshots: Each snapshot references a specific set of data files that represent the state of the table at a point in time. This enables features like time travel.partition-specs: These define how the table is logically partitioned. Query engines leverage this information during planning to skip unnecessary partitions, greatly improving query performance.

This is very similar to Transaction Log (Delta Lake).

Metadata layer:

    Metadata file — The metadata file stores information about a table schema, partition information, and the snapshot details for the table.Manifest List file — Stores the information about all the manifest files and anchors as index for manifest files that are part of a snapshot, along with some additional details, such as how many data files got added, deleted along with partition boundaries.
      A manifest is an immutable Avro file that lists data files or delete files, along with each file’s partition data tuple, metrics, and tracking information. One or more manifest files are used to store a snapshot, which tracks all of the files in a table at some point in time. Manifests are tracked by a manifest list for each table snapshot. Source
    Manifest file — Stores a list of data files (stored in Parquet/ORC/AVRO format), along with the column-level metrics and stats used for file pruning.

Data Layer:

    Data files: Files that can be in Parquet, ORC, or AVRO format, and which store the actual data for the Iceberg Table.

# History

Pushed by Dremio and Snowflake with their Iceberg Table. Google Cloud announced support as well 2023-06-20.

Tabular (Iceberg) is the independent data platform built by the original creators of Apache Iceberg. Tabular addresses the pain data engineers and data scientists endure fighting the shortcomings of their data infrastructure. Tabular was founded by Netflix alumni Ryan Blue, Dan Weeks and Jason Reid. Acquired by Databricks.

# Others

# Icebird

# Further Reads


Origin: Data Lake Table Format
References: Apache Iceberg
Created 2022-08-11

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Iceberg 数据湖 表格式 Apache Iceberg 时间旅行 Schema Evolution Data Lake
相关文章