数据工程中的回填：关键概念与常见陷阱

Second Brain: Crafted, Curated, Connected, Compounded on 10月02日

数据工程中的回填：关键概念与常见陷阱

../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

本文深入探讨了数据工程领域中的“回填”（Backfilling）概念，并引用了Christophe Blefari的观点，强调了回填在区分优秀数据工程师与普通数据工程师中的重要性。文章解释了回填的定义——即重新计算或更新数据资产的历史部分，即使在存在持久化暂存区和纯任务的情况下，理论上可以从头重构整个数据仓库的状态。文章阐述了回填的三种主要场景：新建数据资产（Greenfield）、现有数据资产变更（Brownfield）以及从故障中恢复。同时，也指出了回填过程中可能出现的风险，包括目标选择错误导致的数据不一致或资源浪费、系统资源过载、成本超支以及回填过程中断导致难以恢复的困境。文章强调，周密的计划和执行是避免这些问题的关键。

📦 **回填的定义与理论基础**: 回填是指对正常情况下增量更新的数据资产（包括表格、图像集、模型等）进行历史数据的填充或覆盖。其理论基础在于，若数据管道具有持久化的暂存区和纯粹的任务执行机制，便有可能从零开始完全重构数据仓库的状态，从而为缩短派生表的保留策略提供了可能，并允许按需回填历史数据。

💡 **回填的核心应用场景**: 回填主要应用于三种情况：一是“Greenfield”场景，即新添加了数据资产到数据管道中；二是“Brownfield”场景，当数据管道中的某个资产发生变更时；三是用于从系统故障或数据错误中恢复数据状态。

⚠️ **回填过程中的潜在风险**: 回填操作可能面临多重风险，包括：目标选择不当，如遗漏需要回填的部分导致数据不完整，或回填了不需要的部分造成资源浪费；回填一个数据资产但未同步回填其依赖的派生数据资产，可能导致数据状态不一致；回填过程可能消耗大量内存和计算资源，导致系统过载或饿死重要工作负载；大规模回填可能带来远超预期的成本；以及在回填过程中发生部分失败，导致难以定位问题并需要从头开始的困境。

“Backfilling is where you see the difference between a data engineer and a great data engineer.” — Christophe Blefari on LinkedIn

Given a persistent immutable staging area and pure tasks, in theory it’s possible to recompute the state of the entire warehouse from scratch (not that you should), and get to the exact same state. Knowing this, the retention policy on derived tables can be shorter, knowing that it’s possible to backfill historical data at will.

# What’s a backfill?

A backfill is when you take a data asset that’s normally updated incrementally and update historical parts of it.

For example, you have a table, and each day, you add records to it that correspond to events that happened during that day. Backfilling the table means filling in or overwriting data for days in the past.

We use the term “Data Assets” instead of only “table” because not all pipelines operate on tabular data. You might also backfill a set of images that you use for training a vision model. Or backfill a set of ML models.

# Why backfill your data?

You typically run a backfill if you’re in one of these situations:

Greenfield - you’ve added data assets to your data pipelineBrownfield - one of the assets in your data pipeline has changedRecovering from failure

# Backfills gone wrong

Backfills can go wrong in a few different ways:

Targeting the wrong subset

Resource overload

Cost overload

Getting lost in the middle

To avoid these issues, it’s essential to plan and execute backfills carefully.

Origin:
References: Tenacity (Retrying library for Python)
Created 2022-08-26

# What’s a backfill?

# Why backfill your data?

# Backfills gone wrong

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签