Second Brain: Crafted, Curated, Connected, Compounded on 10月02日
数据编排工具的角色与演进
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

数据编排工具在整合不同技术和格式的工作中扮演关键角色,负责准备、移动和整理数据。它将来自多个数据存储位置的数据整合到编排器中,实现统一管理。随着技术发展,编排器经历了从命令行工具到现代声明式平台的演进,强调数据资产、软件定义资产和最佳工程实践,如版本控制和CI/CD,以应对数据复杂性。

📌 数据编排工具的核心功能是调度和监控工作流,将分散的数据整合到统一平台,确保数据在正确位置被正确处理,支持不同技术和文件格式协同工作。

🚀 编排器的发展历程从1987年的命令行工具cron,到2000年左右的图形化ETL工具如Oracle OWB和SSIS,再到2014年的Python编排器如Apache Airflow、Luigi和Oozie,以及2019年出现的现代编排器如Prefect、Kedro、Dagster和Temporal,最终发展到2022年的企业级和通用编排器如Kestra和Mage。

🔄 现代数据编排强调数据感知和声明式方法,通过数据资产和软件定义资产提升调度和编排效率,支持版本控制、CI/CD和自动化,同时整合生态系统工具以实现数据治理和质量控制。

🔧 数据编排工具需要支持多种连接器,如主要数据库和REST API,以帮助团队高效处理数据,并融入现代数据栈概念,如数据资产和微服务架构,以应对日益复杂的数据环境。

Delving deeper, after selecting technology from the myriad available, you’ll inevitably confront the need to manage intermediate levels. This is particularly true when handling unstructured data, which necessitates transformation into a structured format. Orchestrators and cloud-computing frameworks play a crucial role in this process, ensuring efficient data manipulation across different systems and formats. In the following chapters, I’ll elucidate their role in completing the full architectural picture.

At their core, orchestrators:

    Trigger computations at the appropriate momentsModel dependencies between computationsTrack the execution history of computations

Orchestrators excel in:

    Timing of eventsIdentifying and addressing errorsRestoring correct states

While traditional orchestrators are task-centric, newer ones like Dagster emphasize Data Assets and Software-Defined Asset. This approach enhances scheduling and orchestration, as discussed in Dagster. These advancements align with the Modern Data Stack concepts.

# What is an Orchestrator

An orchestrator is a scheduling and monitor workflows tool. For the different technologies and different file format working together, you need some orchestrator and processing engine that prepares, moves and wrangle the data correctly and to the right place.

Essentially it takes siloed data from multiple data storage locations, combines it, and unifies it within the orchestrator.

# What Language does a Data Orchestrator Speak?

See What language does an Orchestrator speak.

# The Role of Orchestration in Mastering Complexity

Explore the key features in RW Building Better Analytics Pipelines.

Abstractions let you use data pipelines as a microservice on steroids. Why? Because microservices are excellent in scaling but not as good in aligning among different code services. A modern data orchestrator has everything handled around the above reusable abstractions. You can see each task or microservice as a single pipeline with its sole purpose-everything defined in a functional data engineering way. You do not need to start from zero when you start a new microservice or pipeline in the orchestration case.

More on Data Orchestration Trends: The Shift From Data Pipelines to Data Products.

Simple Version


Image from dlthub.com.

Selecting a technology should be followed by choosing an Orchestrator. This crucial step often goes overlooked.

For more insights, read Data Orchestration Trends: The Shift From Data Pipelines to Data Products.

# History: The Data Orchestration Evolution

To understand the why better, let’s go back in time. In 2022 I wrote about the shift and trends in data orchestration, how we moved from pure tasks to more data orchestration. To understand the space, we need to understand the history as we went through different stages.

gantt    title OSS Data Orchestration Evolution    dateFormat  YYYY    axisFormat  %Y    todayMarker off    section Command Line Era    Cron :milestone, m1, 1987, 0d    section ETL Tools Era    Informatica PowerCenter :milestone, m4, 1993, 0d    Oracle OWB :milestone, m2, 2000, 0d    SQL Server Integration Services (SSIS) :milestone, m3, 2005, 0d    section Python Orchestrators    Apache Oozie :milestone, m5, 2011, 0d    Luigi (Spotify) :milestone, m6, 2012, 0d    Apache Airflow (Airbnb) :milestone, m7, 2014, 0d    section Modern Orchestrators    Dagster :milestone, m10, 2018, 0d    Prefect :milestone, m8, 2019, 0d    Kedro :milestone, m9, 2019, 0d    Temporal :milestone, m11, 2019, 0d    section Universal Orchestrators    Kestra :milestone, m12, 2022, 0d    Mage :milestone, m13, 2022, 0d

This chart shows that we have different stages of orchestration:

    In 1987, it started with the command line scheduler tools called cron (Vixie)Then later around 2000 to more graphical ETL, drag-and-drop tools such as Oracle OWB, SQL Server Integration Services, InformaticaTo Python and big data orchestrators that schedule tasks around 2014 with Apache Airflow, Luigi, OozieTo modern orchestrators around 2019 such as Prefect, Kedro, Dagster, or TemporalTo enterprise and universal orchestrators more approachable such as Kestra, Mage

Today we talk more about the data assets, specific data tables, BI dashboards, or an S3 bucket. People don’t mind about the ETL in-between and its transformations. Frankly, business people don’t care about our well-crafted data pipelines.

But as a matter of fact, it’s where the heavy lifting takes place, and our data assets get created. Besides more focus on data-aware orchestration, it’s also important to get software engineering best practices in place for our central orchestration tool. Practices such as versioning our pipelines in a version control to go back in case of a faulty new version, CI/CD to detect bugs early in the lifecycle. Most importantly, to be able to do all of it, we need a declarative data orchestrator. An orchestrator and its pipelines that can be defined declaratively. How does that work?

Data-aware orchestration knows more about the data it runs, it can re-use existing technical implementations for different tasks, or can pass around data to the next task, without treating it like we know nothing about it.

And declarative means we can configure and specify what we want to orchestrate, and isolate the how (technical implementation). A declarative approach also lets you quickly update the pipeline with configurations, version it, rollback in case of error, test and automate and apply Software Engineering Best Practices to data pipelines.

Besides data awareness and declarative approach, the shift has gone further to simple tools that help consolidate the ecosystem. These integrate the strengths of different tools, and orchestrate them to get data governance and data quality under control. Especially if you have source connectors to major databases and REST APIs out of the box, this will help any team tremendously.

The full list of open-source data orchestrators
If you are curious and want to see the complete list of tools and frameworks, I suggest you check out the Awesome Pipeline List on GitHub.

# An older version:

Orchestrators have evolved from simple task managers to complex systems integrating with the Modern Data Stack. Let’s trace their journey:

    1987: The inception with (Vixie) cron2000: The emergence of graphical ETL tools like Oracle Warehouse Builder (OWB), SSIS, Informatica2011: The rise of Hadoop orchestrators like Luigi, Oozie, Azkaban2014: The rise of simple orchestrators like Airflow2019: The advent of modern orchestrators with Python like Prefect, Kedro Dagster, Temporal or even fully SQL framework dbtTo declarative pipelines fully managed into Ascend.io, Palantir Foundry and other data lake solutions

For an exhaustive list, visit the Awesome Pipeline List on GitHub. More on the history on Bash-Script vs. Stored Procedure vs. Traditional ETL Tools vs. Python-Script - 📖 Data Engineering Design Patterns (DEDP).https://www.ssp.sh/brain/star-history-202434.pnghistory-202434.png">
Also check GitHub Star History, eventough they don’t tell you much.

# Another version

A nice illustration by dlt on On Orchestrators: You Are All Right, But You Are All Wronhttps://www.ssp.sh/brain/Pasted%20image%2020240601172559.png/Pasted%20image%2020240601172559.png">

# Control Plane

As of 2024-07-09:
Data orchestrators are the control plane that keeps the heterogeneous data stack together. I like dagster; even though it’s harder to start, it “forces” you, gently :), to use good practice. For example, technical code can be distinguished into resources for everyone to resource (typically data engineers maintain), and business logic can be written by domain experts that nowadays can be written directly close to the data assets. These are declarative and can easily be used to automate and version. Also, everything can run locally with a mocked spark cluster as on production with databricks, without changing any line of DAG config; the only thing is to define run-configs for each environment.

# Different Types of Orchestration

As of 2022-09-21:

    Airflow when you need task scheduling only (no data awareness)Dagster when you foresee higher-level data engineering problems. Dagster has more abstractions as they grew from first principles with a holistic view in mind from the very beginning. They focus heavily on data integrity, testing, idempotency, data assets, etc.
    Prefect if you need a fast and dynamic modern orchestration with a straightforward way to scale out. They recently revamped the prefect core as Prefect 2.0 with a new second-generation orchestration engine called Orion. It has several abstractions that make it a Swiss army knife for general task management.
      With the new engine Orion they built in Prefect 2.0, they’re very similar to Temporal and support fast low latency application orchestration
    others:
      k8s native orchestrationYAML orchestration: Kestra, etc.

Or said others in this Tweet - I’d use:

    Airflow for plain task-schedulingPrefect fast, low-latency imperative workflow schedulingDagster for data-aware pipelines when you want best-in-class, but opinionated support

Also, explore insights from the podcast Re-Bundling The Data Stack With Data Orchestration And Software Defined Assets Using Dagster | Data Engineering Podcast with Nick Schrock.

# Asset Oriented Orchestration

2025-04-23 Airflow announced their 3.0 version where they introduce data assets.

I haven’t tested the newest Airflow asset features, but I don’t believe you can just switch from being task-oriented to assets. This is a much deeper shift that is hard to get for Airflow. But I need to dig a little deeper myself. Meanwhile, I’d still argue that Dagster is still miles ahead, with its code locations that support data mesh or resources that encapsulate business from technical implementation or just the ease with which you can get started with Dagster. On top, there are also other newer tools like Kestra, that tackle the none-business users more while still being code-based.

But I’m happy to see the market evolve and that orchestration aligns with things. LinkedIn

# Comparing Dagster with Vim

Dagster is vim for orchestration. It has a steeper learning curve; you need to learn its concepts. Initially, it’s harder, but with complex/heterogeneous data infrastructure, these concepts can save you time and money.

Take vim motions; they are hard to learn but worth every minute if you write/code all day. Like orchestrating, if data and managing complexity is core to your business, it’s worth having a robust, battle-tested architecture in place, and you get it out of the box with dagster. Tweet.

Fun is another great analogy to vim. Vim to me is more fun to use than VS Code (see PDE). Dagster is also more fun for data engineers as it focuses heavily on data engineers and developer productivity.

# Comparing Prefect, Dagster, Airflow, Mage

Image by Decoding Data Orchestration Tools: Comparing Prefect, Dagster, https://www.ssp.sh/brain/img_Data%20Orchestrators_1749022578914.webprc="/brain/img_Data%20Orchestrators_1749022578914.webp">

# What language does an Orchestrator speak?

What language does an Orchestrator speak

# Further Reads

Here are two deep dives of mine about this very topic:


Origin:
References: Python What is an Orchestrator Why you need an Orchestrator Apache Airflow
Created:

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

数据编排 工作流管理 数据集成 ETL工具 现代数据栈
相关文章