Lakeflow Jobs 新增回填运行功能，简化数据处理

Backfill runs are now generally available in Lakeflow Jobs.

Managing complex data ecosystems with numerous sources and constant updates is challenging for data engineering teams. They often face unpredictable but common issues like cloud vendor outages, broken connections to data sources, late-arriving data, or even data quality issues at the source. Other times, they have to deal with sudden business rule changes that impact the entire data orchestration.

The result? Downstream data is stale, inaccurate, or incomplete. While backfilling - rerunning jobs with historical data - is a common need and solution to this, traditional manual and ad hoc backfills are tedious, error-prone, and don't scale, hindering efficient resolution of common data quality issues.

Backfill runs in Lakeflow Jobs is a simpler and more data-aware approach

Imagine you are a data engineer at a retail company responsible for creating a weekly order summary report for the Business Intelligence team. Your report is critical for tracking revenue and customer behavior for dynamic sales generation. This Job is scheduled to run every Monday morning before the new work week begins, and uses the iso_datetime job parameter to timestamp your data.

One morning, you discover that a broken connection to one of your data sources from the past 3 weeks caused critical pricing data to be omitted, making your entire summary table inaccurate. Simultaneously, the marketing team just introduced a new formula for calculating customer lifetime value (LTV), and they need all historical order data to be reprocessed to reflect this new business logic. This adds a new layer of complexity to your data orchestration that needs to be addressed promptly, given the marketing analytics and strategy needs.

Lakeflow Jobs can resolve both issues using the new backfill runs, which easily processes historical data directly in Lakeflow Jobs, all in a no-code UI. Simply by clicking “Run backfill” in the Jobs UI, you can configure the date and time range for the historical data, choose the granularity at which you want to run the job, and select the parameters for that backfill that you’d wish to override, without writing a single line of code.

In the image above, 7 backfill runs will be created at one-day intervals, the first on October 9th, 2025 at 10:00 AM, and the last on October 16th, 2025 at 10:00 AM. The parameter "backfill.iso_date" will be passed into each backfill run (e.g. 2025-10-09 10:00:00.000 for the first run). Once “Run” is clicked, 7 concurrent runs will be automatically triggered to backfill the data in your job.

In short, backfill runs in Lakeflow Jobs helps you:

Ensure that you have the most complete and up-to-date datasets
Simplify and accelerate access to historical data with an intuitive, no-code interface
Improve data engineering productivity by eliminating the need for manual data searches and backfill processes

How backfill runs fix data gaps and business rule changes

Let’s return to our earlier retail example. You had to deal with outdated pricing data and a new LTV formula, both of which impact your downstream data.

Backfill runs looks back into recent failures to identify the optimal lookback period and checks the schedule of your job to recommend a granularity for the backfill. In this example, three backfill runs (once per week) will be created. When run, the backfill runs add the missing data and reprocesses historical orders with the new business logic. This eliminates the need for manual scripts, ensuring an accurate and consistent dataset in one efficient step.

Scheduled jobs normally have date parameters on them to partition the processed data. Lakeflow Jobs' backfill runs leverages these parameters to create backfills. When you create the backfill, Jobs changes the date parameter to the specific past dates, and uses that to process historical data. If your job does not have a date parameter, we’ll warn you so you can add the parameter at runtime to execute the backfill. This means that, as long as your data is partitioned by date, Backfill Runs will work to backfill your data, even if the data in your source tables has changed.

Better data and AI orchestration with Lakeflow Jobs

By leveraging the new backfill runs in Lakeflow Jobs, you can move from a reactive, manual data engineering approach to a faster, data-aware one. Backfill runs helps you not only save countless hours solving data quality issues but also ensure the integrity and reliability of your data. Now, you can be more confident about the freshness, accuracy, and relevancy of your downstream analytics. With Databricks’ native orchestrator for data, analytics, and AI, you modernize your data engineering strategy and let us do the heavy lifting for you.

Backfill runs is now available to all Databricks customers using Lakeflow Jobs. To get started:

Backfill runs in Lakeflow Jobs is a simpler and more data-aware approach

How backfill runs fix data gaps and business rule changes

Better data and AI orchestration with Lakeflow Jobs

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签