声明式数据栈的兴起与意义

Declarative Data Stack is a term introduced in the article The Rise of the Declarative Data Stack by Mike Driscoll and myself.

# What Is a Declarative Data Stack?

A declarative data stack is a set of tools and, precisely, its configs can be thought of as a single function such as run_stack(serve(transform(ingest))) that can recreate the entire data stack.

Instead of having one framework for one piece, we want a combination of multiple tools combined into a single declarative data stack. Like the Modern Data Stack, but integrated the way Kubernetes integrates all infrastructure into a single deployment, like YAML.

We focus on the end-to-end Data Engineering Lifecycle, from ingestion to visualization. But what does the combination with declarative mean? Think of Functional Data Engineering, which leaves us in a place of confident reproducibility with little side effects (hopefully none) and uses idempotency to restart function to recover and reinstate a particular state with conviction or rollback to a specific version.

More on The Rise of the Declarative Data Stack.

Other Naming

Tobiko calls it an Integrated Data Stack.

Dagster talks about Impedance Mismatch, and Data Asset oriented orchestration: The Data Engineering Impedance Mismatch | Dagster Blog

# Why you need a Declarative Data Stack?

Why do you need an declarative data stack or Open Enterprise Data Platform, you might ask?

Data grows yearly more than the entire lifetime before. There is a growing need to make sense of more data. In the old days, you had a single vendor solution: think of SAP or Oracle. These days, new SaaS and open-source tools products created daily, specializing in a tiny little niche. So why would you need another platform?

You want the best of both worlds. You want Open-Source not to be locked in and to use the strongest, collaboratively created tools in the open. People worldwide can fix a security bug in minutes or fix it yourself—compared to an extensive vendor where you solely rely on their update cycle.

The downside of open-source is that there are a lot of bugs, missing features, and independent tools. That’s precisely where HelloDATA BE comes into play. We are building the missingplatform that combines the best-of-breed open-source technologies into a single portal, making it enterprise-ready.

# How to balance flexibility with the simplicity

Flexibility is one disadvantage of DDS as you may need to implement the interface first to respect new functions in your configs. Therefore, if you require full flexibility, an imperative approach is preferable.

I’d say start somewhere, try it out, feel it, and adjust. If you know you’re going to throw it away, there’s no need for a DDS. However, if you know you are at the enterprise level, you most probably want the simplicity and also the restrictive nature of DDS. asked on LI

# Close-Sourced DDSEs

Let’s start with closed-source first—one key point to note. Most of what we’ve discussed here is something that most closed-source tools have implemented in one way or another. Because they’ve built one big monolith, this is relatively straightforward and the natural thing to do.

This can be more challenging and not immediately obvious with an open-source approach and numerous integration tools. Let’s now look at tools that have successfully implemented such features.

Ascend

Palantir Foundry

Closed-Source Data Platforms

a new frontier

Y42

Usually, the problem with closed-source software is that it is structured as a monolith, combining transformation logic with persisted database tables while keeping the underlying code unknown.

# Open-Source DDSEs

But even more interesting are the open-source tools I found[^1] - they are fantastic and built in the open, building in the open. Not all might be truly declarative data stacks by their definition, but they all build on top of other tools and declaratively integrate them.

DataForge

Starlake

Dashtool

BoilingData:

HelloDATA BE

SDF

SQLMesh

SQLGlot

GitHub Actions

simplest version

deploy.yaml

Datacoves

Datadex

BigFunctions

Hoptimator

Bruin

Iasql:

SQL Declarative-Data-Stack

SQL Compilers
SQLGlot would be a good integration to parse SQL without running. Same as SDF integration with Datafusion.

Existing Templating
Beyond tools, templating can solve some of the Jinja Template, GoLang’s template package, biGENIUS Template modules, Apache Velocity, Liquid, and many others.

Docker and Kubernetes are other engines. There are many more, see part-3-example-implementation-declarative-data-stack.

# Further Reading

Origin: Rill | The Rise of the Declarative Data Stack, The Rise of the Declarative Data Stack - Rill
References:
Created 2024-10-17

# What Is a Declarative Data Stack?

# Why you need a Declarative Data Stack?

# How to balance flexibility with the simplicity

# Close-Sourced DDSEs

# Open-Source DDSEs

# Alternatives

# What’s the difference between a Composable Data System

# Testing

# Declarative Data Stack ENGINE

# Further Reading

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签