Apache Airflow：工作流的调度与监控

As outlined in Orchestrators, the scheduling and monitoring of workflows stand as pivotal choices. While there are numerous Orchestrators, Airflow emerges as the most prevalent and acclaimed.

Traditionally, ETL tools such as Microsoft SQL Server Integration Services (SSIS) dominated the scene, serving as hubs for data transformation and cleaning, as well as Normalization processes.

However, contemporary architectures demand more. The value of code and data transformation logic now extends beyond their immediate functional use, proving essential to other data-informed individuals within an organization.

I highly recommend delving into Maxime Beauchemin’s piece on Functional Data Engineering — a modern paradigm for batch data processing for a deeper understanding of modern data pipelines.

# What is Apache Airflow?

Apache Airflow™ is an open-source platform for developing, scheduling, and monitoring batch-oriented workflows.

Airflow’s extensible Python framework enables you to build workflows connecting with virtually any technology. A web interface helps manage the state of your workflows. Airflow is deployable in many ways, varying from a single process on your laptop to a distributed setup to support even the biggest workflows.

Apache Airflow Core, which includes webserver, scheduler, CLI and other components that are needed for minimal Airflow installation. Read the documentation »

# Paradigms: Workflows as code

The main characteristic of Airflow workflows is that all workflows are defined in Python code. “Workflows as code” serves several purposes:

Dynamic

Extensible

Flexible

Jinja

# Why Airflow?

Airflow is a batch workflow orchestration platform. Unlike others such as Dagster, Kestra, which focus on data-aware orchestration, Airflow is mainly to manage tasks and workflows.

The Airflow framework contains operators to connect with many technologies and is easily extensible to connect with a new technology. If your workflows have a clear start and end, and run at regular intervals, they can be programmed as an Airflow DAG.

If you prefer coding over clicking, Airflow is the tool for you. Workflows are defined as Python code which means:

Workflows can be stored in version control so that you can roll back to previous versionsWorkflows can be developed by multiple people simultaneouslyTests can be written to validate functionalityComponents are extensible and you can build on a wide collection of existing components

Rich scheduling and execution semantics enable you to easily define complex pipelines, running at regular intervals. Backfilling allows you to (re-)run pipelines on historical data after making changes to your logic. And the ability to rerun partial pipelines after resolving an error helps maximize efficiency.

# Why not Airflow

Because the data we work with are persited Data Assets and the only component of the data stack that does not work with assets, is Airflow. It orients around jobs. In another words, it’s imperative instead of declarative.

From Data Council talk by dagster | The Data Engineering Impedance Mismatch | Dagster Blog

A better way of working that workflows is with the assets in mind. Check out Dagster for more.

# Update on Declarative

Since 2024-08-13, Airflow also support a declarative approach with the Airflow DAG Factory Pattern.

# We are all using Airflow wrong

# Airflow Commands

To obtain the current executors (Supported: LocalExecutor, CeleryExecutor, KubernetesExecutor, LocalKubernetesExecutor, CeleryKubernetesExecutor):

`1`	`airflow config get-value core executor`

# Airflow Parameterization

see load_type running the test.py. Also, see that we use Jinja Template here with {{params.load_type}}:

 1 2 3 4 5 6 7 8 910111213141516171819202122232425with DAG(    dag_id="sleep_hosp",    schedule="@once",    description="developing and testing",    tags=[workspace_name],    params={        "load_type": "init"    }) as dag:    sleep = KubernetesPodOperator(        **common_k8s_pod_operator_params,        name="sleep_hosp",        task_id="sleep_hosp",        # arguments=[        #     "python dags-logic/test.py --setup={{ params.load_type }} && echo 'setup successfully'"        # ],        arguments=[            "python dags-logic/test.py --setup={{params.load_type}} && echo 'setup successfully'" ],        )    (        sleep    )

In test.py we can then work with the parameter:

 1 2 3 4 5 6 7 8 910111213141516import argparsedef setup_lzn_tables(param):    print(f"Print params: {param}")if __name__ == "__main__":    parser = argparse.ArgumentParser(        description="Run specific functions in the script."    )    parser.add_argument("--setup", required=True)    args = parser.parse_args()    if args.setup:        setup_lzn_tables(args.setup)

Warning
The window with entering the paramters only comes after you click on the trigger DAGj

More on Params — Airflow Documentation.

# Template Reference

From Templates reference:

Variable	Type	Description
`{{ params }}`	dict[str, Any]	The user-defined params. This can be overridden by the mapping passed to `trigger_dag -c` if `dag_run_conf_overrides_params` is enabled in `airflow.cfg`.

# Airflow Operators

# PythonOperator

The Airflow PythonOperator is optimal when the business logic and code are housed within the Airflow DAG directory. The PythonOperator facilitates the import and execution of these components.

123456789airflow    \__dags        \_classification_workflow.py        \_ tweet_classification            \_preprocess.py            \_predict.py            \_ __init__.py    \__logs    \__airflow.cfg

# Pros

Ideal when the code is in the same repository as Airflow.User-friendly and straightforward.Efficient for smaller teams.

# Cons

Tightly couples Airflow code with business logic.Changes in business logic necessitate Airflow code redeployment.Sharing a single Airflow instance across multiple projects becomes challenging.Limited to Python code.

# DockerOperator

Caution Advised
The DockerOperator is becoming obsolete. It’s recommended to opt for the . As highlighted in this StackOverflow discussion, “The real answer is to use the KubernetesPodOperator. DockerOperator will soon lose its functionality with the phasing out of dockershim.”

The DockerOperator in Airflow manages business logic and code within a Docker image. Upon execution:

Airflow fetches the designated image.Initiates a container.Executes the given command.Requires an active Docker daemon.

12345678DockerOperator(    dag=dag,    task_id='docker_task',    image='gs://project-predict/predict-api:v1',    auto_remove=True,    docker_url='unix://var/run/docker.sock',    command='python extract_from_api_or_something.py')

123456KubernetesPodOperator(        task_id='classify_tweets',        name='classify_tweets',        cmds=['python', 'app/classify.py'],        namespace='airflow',        image='gcr.io/tweet_classifier/dev:0.0.1')

Local Log Folder Configuration

Non-Backfillable DAGs

Lack of Scalability Preparation

For a comprehensive discussion, refer to Mistakes I Have Seen When Data Teams Deploy Airflow.

Origin:
References: Data Orchestrators Dagster Apache Airflow Wiki Why using Airflow Features of Airflow, OLAP
Created:

# What is Apache Airflow?

# Paradigms: Workflows as code

# Why Airflow?

# Why not Airflow

# Update on Declarative

# We are all using Airflow wrong

# Airflow Commands

# Airflow Parameterization

# Template Reference

# Airflow Operators

# PythonOperator

# Pros

# Cons

# DockerOperator

# Pros

# Cons

# KubernetesPodOperator

# Pros

# Cons

# Common Mistakes

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签