Databricks 10月28日 02:04
Lakeflow Jobs 新增回填运行功能,简化数据处理
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

数据工程团队在管理复杂数据生态系统时常面临数据延迟、质量问题和业务规则变更等挑战,导致下游数据过时或不准确。Lakeflow Jobs 推出的回填运行功能,提供了一种更简单、数据感知的方式来处理历史数据。用户可通过直观的无代码UI,轻松配置回填日期范围、选择粒度并覆盖参数,无需编写脚本即可解决数据缺失和业务逻辑更新问题。这项新功能有助于确保数据集的完整性、准确性和时效性,提高数据工程效率,并支持更可靠的数据和AI编排。

📊 **回填运行简化数据处理流程:** Lakeflow Jobs 的新回填运行功能,允许用户通过简化的无代码UI界面,直接处理历史数据。这解决了数据工程团队在面对云服务中断、数据源连接中断、数据延迟或数据质量问题时的常见痛点,避免了传统手动回填的繁琐和易错性。

💡 **应对复杂数据挑战:** 文章通过零售公司订单摘要报告的例子,生动展示了回填运行如何解决因数据源连接中断导致的定价数据缺失,以及营销团队引入新LTV计算公式需要重新处理历史数据等复杂场景。这表明回填运行能够有效地处理数据不准确和业务逻辑变更的双重需求。

⚙️ **智能化配置与执行:** 用户只需在Jobs UI中点击“运行回填”,即可配置历史数据的时间范围、选择运行粒度,并覆盖特定参数,Lakeflow Jobs 会自动创建并触发并发运行,高效地处理历史数据。如果作业没有日期参数,系统会发出警告,提示用户添加以支持回填。

🚀 **提升数据工程效率与可靠性:** 回填运行功能帮助数据工程师从被动、手动的数据质量问题解决方式,转变为更主动、数据感知的方法。这不仅节省了大量解决数据质量问题的时间,还增强了下游分析的准确性、时效性和相关性,为数据和AI编排提供了更坚实的基础。

Backfill runs are now generally available in Lakeflow Jobs

Managing complex data ecosystems with numerous sources and constant updates is challenging for data engineering teams. They often face unpredictable but common issues like cloud vendor outages, broken connections to data sources, late-arriving data, or even data quality issues at the source. Other times, they have to deal with sudden business rule changes that impact the entire data orchestration

The result? Downstream data is stale, inaccurate, or incomplete. While backfilling - rerunning jobs with historical data - is a common need and solution to this, traditional manual and ad hoc backfills are tedious, error-prone, and don't scale, hindering efficient resolution of common data quality issues.

Backfill runs in Lakeflow Jobs is a simpler and more data-aware approach

Imagine you are a data engineer at a retail company responsible for creating a weekly order summary report for the Business Intelligence team. Your report is critical for tracking revenue and customer behavior for dynamic sales generation. This Job is scheduled to run every Monday morning before the new work week begins, and uses the iso_datetime job parameter to timestamp your data.

One morning, you discover that a broken connection to one of your data sources from the past 3 weeks caused critical pricing data to be omitted, making your entire summary table inaccurate. Simultaneously, the marketing team just introduced a new formula for calculating customer lifetime value (LTV), and they need all historical order data to be reprocessed to reflect this new business logic. This adds a new layer of complexity to your data orchestration that needs to be addressed promptly, given the marketing analytics and strategy needs.

Lakeflow Jobs can resolve both issues using the new backfill runs, which easily processes historical data directly in Lakeflow Jobs, all in a no-code UI. Simply by clicking “Run backfill” in the Jobs UI, you can configure the date and time range for the historical data, choose the granularity at which you want to run the job, and select the parameters for that backfill that you’d wish to override, without writing a single line of code. 

In the image above, 7 backfill runs will be created at one-day intervals, the first on October 9th, 2025 at 10:00 AM, and the last on October 16th, 2025 at 10:00 AM. The parameter "backfill.iso_date" will be passed into each backfill run (e.g. 2025-10-09 10:00:00.000 for the first run). Once “Run” is clicked, 7 concurrent runs will be automatically triggered to backfill the data in your job.

In short, backfill runs in Lakeflow Jobs helps you:

  • Ensure that you have the most complete and up-to-date datasets
  • Simplify and accelerate access to historical data with an intuitive, no-code interface
  • Improve data engineering productivity by eliminating the need for manual data searches and backfill processes

 

How backfill runs fix data gaps and business rule changes

Let’s return to our earlier retail example. You had to deal with outdated pricing data and a new LTV formula, both of which impact your downstream data.

Backfill runs looks back into recent failures to identify the optimal lookback period and checks the schedule of your job to recommend a granularity for the backfill. In this example, three backfill runs (once per week) will be created. When run, the backfill runs add the missing data and reprocesses historical orders with the new business logic. This eliminates the need for manual scripts, ensuring an accurate and consistent dataset in one efficient step.

Scheduled jobs normally have date parameters on them to partition the processed data. Lakeflow Jobs' backfill runs leverages these parameters to create backfills. When you create the backfill, Jobs changes the date parameter to the specific past dates, and uses that to process historical data. If your job does not have a date parameter, we’ll warn you so you can add the parameter at runtime to execute the backfill. This means that, as long as your data is partitioned by date, Backfill Runs will work to backfill your data, even if the data in your source tables has changed.

Better data and AI orchestration with Lakeflow Jobs

By leveraging the new backfill runs in Lakeflow Jobs, you can move from a reactive, manual data engineering approach to a faster, data-aware one. Backfill runs helps you not only save countless hours solving data quality issues but also ensure the integrity and reliability of your data. Now, you can be more confident about the freshness, accuracy, and relevancy of your downstream analytics. With Databricks’ native orchestrator for data, analytics, and AI, you modernize your data engineering strategy and let us do the heavy lifting for you.

Backfill runs is now available to all Databricks customers using Lakeflow Jobs. To get started: 

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Lakeflow Jobs 数据工程 数据处理 回填运行 数据质量 AI orchestration Data Engineering Data Processing Backfill Runs Data Quality
相关文章