Second Brain: Crafted, Curated, Connected, Compounded on 10月02日
数据工程中的回填:关键概念与常见陷阱
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文深入探讨了数据工程领域中的“回填”(Backfilling)概念,并引用了Christophe Blefari的观点,强调了回填在区分优秀数据工程师与普通数据工程师中的重要性。文章解释了回填的定义——即重新计算或更新数据资产的历史部分,即使在存在持久化暂存区和纯任务的情况下,理论上可以从头重构整个数据仓库的状态。文章阐述了回填的三种主要场景:新建数据资产(Greenfield)、现有数据资产变更(Brownfield)以及从故障中恢复。同时,也指出了回填过程中可能出现的风险,包括目标选择错误导致的数据不一致或资源浪费、系统资源过载、成本超支以及回填过程中断导致难以恢复的困境。文章强调,周密的计划和执行是避免这些问题的关键。

📦 **回填的定义与理论基础**: 回填是指对正常情况下增量更新的数据资产(包括表格、图像集、模型等)进行历史数据的填充或覆盖。其理论基础在于,若数据管道具有持久化的暂存区和纯粹的任务执行机制,便有可能从零开始完全重构数据仓库的状态,从而为缩短派生表的保留策略提供了可能,并允许按需回填历史数据。

💡 **回填的核心应用场景**: 回填主要应用于三种情况:一是“Greenfield”场景,即新添加了数据资产到数据管道中;二是“Brownfield”场景,当数据管道中的某个资产发生变更时;三是用于从系统故障或数据错误中恢复数据状态。

⚠️ **回填过程中的潜在风险**: 回填操作可能面临多重风险,包括:目标选择不当,如遗漏需要回填的部分导致数据不完整,或回填了不需要的部分造成资源浪费;回填一个数据资产但未同步回填其依赖的派生数据资产,可能导致数据状态不一致;回填过程可能消耗大量内存和计算资源,导致系统过载或饿死重要工作负载;大规模回填可能带来远超预期的成本;以及在回填过程中发生部分失败,导致难以定位问题并需要从头开始的困境。

“Backfilling is where you see the difference between a data engineer and a great data engineer.” — Christophe Blefari on LinkedIn

Given a persistent immutable staging area and pure tasks, in theory it’s possible to recompute the state of the entire warehouse from scratch (not that you should), and get to the exact same state. Knowing this, the retention policy on derived tables can be shorter, knowing that it’s possible to backfill historical data at will.

# What’s a backfill?

A backfill is when you take a data asset that’s normally updated incrementally and update historical parts of it.

For example, you have a table, and each day, you add records to it that correspond to events that happened during that day. Backfilling the table means filling in or overwriting data for days in the past.

We use the term “Data Assets” instead of only “table” because not all pipelines operate on tabular data. You might also backfill a set of images that you use for training a vision model. Or backfill a set of ML models.

# Why backfill your data?

You typically run a backfill if you’re in one of these situations:

    Greenfield - you’ve added data assets to your data pipelineBrownfield - one of the assets in your data pipeline has changedRecovering from failure

# Backfills gone wrong

Backfills can go wrong in a few different ways:

    Targeting the wrong subset - If you neglect to backfill parts of your data that need to be backfilled, you risk finishing with the false impression that your data is up-to-date. If you backfill parts of your data that don’t need to be backfilled, then you’ve wasted some time and money. If you backfill a data asset without backfilling the data assets derived from it, you risk ending up with your data in an inconsistent and confusing state.Resource overload - Backfills can require significant amounts of memory and processing power. This can cause them to overwhelm your system or starve important workloads.Cost overload - A large backfill might end up costing much more than expected.Getting lost in the middle - If parts of your backfill fail, you can end up in a state where you know something has gone wrong but don’t know what, and need to restart from the beginning.

To avoid these issues, it’s essential to plan and execute backfills carefully.

Read more on Backfills in Data & Machine Learning: A Primer.


Origin:
References: Tenacity (Retrying library for Python)
Created 2022-08-26

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

数据工程 回填 数据管道 数据资产 Data Engineering Backfilling Data Pipeline Data Assets
相关文章