Second Brain: Crafted, Curated, Connected, Compounded on 10月02日
声明式数据栈的兴起与意义
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

声明式数据栈是一种集成多种工具和配置的解决方案,能够以单一函数形式重现整个数据栈,如run_stack(serve(transform(ingest)))。它强调端到端的数据工程生命周期管理,从数据摄取到可视化,并注重功能式数据工程的特点,如可重复性、低副作用和幂等性。声明式数据栈旨在解决传统单一供应商解决方案的局限性,同时结合开源工具的优势,提供灵活性和易用性,适用于企业级应用。它通过整合最佳的开源技术,如dbt、Airflow、Superset和JupyterHub,构建一个统一的数据平台,支持多租户、全认证和授权,并通过单一Web门户进行管理。

🔍 声明式数据栈是一种集成多种工具和配置的解决方案,能够以单一函数形式重现整个数据栈,如run_stack(serve(transform(ingest)))。它强调端到端的数据工程生命周期管理,从数据摄取到可视化,并注重功能式数据工程的特点,如可重复性、低副作用和幂等性。

🌐 声明式数据栈旨在解决传统单一供应商解决方案的局限性,同时结合开源工具的优势,提供灵活性和易用性,适用于企业级应用。它通过整合最佳的开源技术,如dbt、Airflow、Superset和JupyterHub,构建一个统一的数据平台,支持多租户、全认证和授权,并通过单一Web门户进行管理。

🚀 声明式数据栈通过整合最佳的开源技术,如dbt、Airflow、Superset和JupyterHub,构建一个统一的数据平台,支持多租户、全认证和授权,并通过单一Web门户进行管理。它旨在提供一个灵活且易于使用的平台,帮助企业更好地管理和分析数据。

🔧 声明式数据栈强调配置的单一性和可重复性,通过声明式配置文件,可以轻松地重现整个数据栈,从而提高数据工程的效率和可靠性。它还支持功能式数据工程的特点,如可重复性、低副作用和幂等性,从而确保数据处理的正确性和一致性。

🔒 声明式数据栈通过整合多种工具和配置,提供了一个统一的数据平台,支持多租户、全认证和授权,并通过单一Web门户进行管理。它旨在提供一个安全且可靠的数据平台,帮助企业更好地管理和保护数据。

Declarative Data Stack is a term introduced in the article The Rise of the Declarative Data Stack by Mike Driscoll and myself.

# What Is a Declarative Data Stack?

A declarative data stack is a set of tools and, precisely, its configs can be thought of as a single function such as run_stack(serve(transform(ingest))) that can recreate the entire data stack.

Instead of having one framework for one piece, we want a combination of multiple tools combined into a single declarative data stack. Like the Modern Data Stack, but integrated the way Kubernetes integrates all infrastructure into a single deployment, like YAML.

We focus on the end-to-end Data Engineering Lifecycle, from ingestion to visualization. But what does the combination with declarative mean? Think of Functional Data Engineering, which leaves us in a place of confident reproducibility with little side effects (hopefully none) and uses idempotency to restart function to recover and reinstate a particular state with conviction or rollback to a specific version.

More on The Rise of the Declarative Data Stack.

Other Naming

Tobiko calls it an Integrated Data Stack.

Dagster talks about Impedance Mismatch, and Data Asset oriented orchestration: The Data Engineering Impedance Mismatch | Dagster Blog

# Why you need a Declarative Data Stack?

Why do you need an declarative data stack or Open Enterprise Data Platform, you might ask?

Data grows yearly more than the entire lifetime before. There is a growing need to make sense of more data. In the old days, you had a single vendor solution: think of SAP or Oracle. These days, new SaaS and open-source tools products created daily, specializing in a tiny little niche. So why would you need another platform?

You want the best of both worlds. You want Open-Source not to be locked in and to use the strongest, collaboratively created tools in the open. People worldwide can fix a security bug in minutes or fix it yourself—compared to an extensive vendor where you solely rely on their update cycle.

The downside of open-source is that there are a lot of bugs, missing features, and independent tools. That’s precisely where HelloDATA BE comes into play. We are building the missingplatform that combines the best-of-breed open-source technologies into a single portal, making it enterprise-ready.

# How to balance flexibility with the simplicity

Flexibility is one disadvantage of DDS as you may need to implement the interface first to respect new functions in your configs. Therefore, if you require full flexibility, an imperative approach is preferable.

I’d say start somewhere, try it out, feel it, and adjust. If you know you’re going to throw it away, there’s no need for a DDS. However, if you know you are at the enterprise level, you most probably want the simplicity and also the restrictive nature of DDS. asked on LI

# Close-Sourced DDSEs

Let’s start with closed-source first—one key point to note. Most of what we’ve discussed here is something that most closed-source tools have implemented in one way or another. Because they’ve built one big monolith, this is relatively straightforward and the natural thing to do.

This can be more challenging and not immediately obvious with an open-source approach and numerous integration tools. Let’s now look at tools that have successfully implemented such features.

    Ascend: The platform automates up to 90% of repetitive data tasks using their DataAware Automation Engine.Palantir Foundry: One of the first lakehouse implementations before the term was coined. Enables real-time collaboration between data, analytics, and operational teams through a common logic data lake layerFind more on Closed-Source Data Platforms and a fantastic read on composable data stacks on a new frontier by Voltron Data.Y42: Mission Control for Your Data Pipelines.

Usually, the problem with closed-source software is that it is structured as a monolith, combining transformation logic with persisted database tables while keeping the underlying code unknown.

# Open-Source DDSEs

But even more interesting are the open-source tools I found[^1] - they are fantastic and built in the open, building in the open. Not all might be truly declarative data stacks by their definition, but they all build on top of other tools and declaratively integrate them.

    DataForge: Write functional transformation pipelines by leveraging software engineering principles. It does not have a visualization tool that focuses on transformation[^3].Starlake: Declarative text based tool for data analysts and engineers to extract, load, transform and orchestrate their data pipelines.Dashtool: A Lakehouse build tool that builds Iceberg tables from declarative SQL statements and generates Kubernetes workflows to keep these tables up-to-date. It handles Ingestion, Transformation, and Orchestration. Written in Rust and uses Datafusion.BoilingData: A local-first data processing native application designed for rapid data pipeline development. Enables data engineers to build and test pipelines quickly using tools like DuckDB, dbt, and dlt.HelloDATA BE: An enterprise data platform built on open-source tools based on the modern data stack. It uses state-of-the-art tools such as dbt for data modeling with SQL and Airflow to run and orchestrate tasks, Superset to visualize the BI dashboards, and JupyterHub for data science tasks. It includes multi-tenancy, full authentication, and authorization, which are handled with a single web portal.SDF: Similar to DataForge, built on Rust and Datafusion. Tries to be the Typescript for SQL, creating faster development cycles and reliable results with a powerful compiler.SQLMesh: An efficient data transformation and modeling framework that has compiler capabilities built-in with SQLGlot, a Python SQL parser and a transpiler.GitHub Actions: This is the simplest version for building a declarative data stack. A deploy.yaml script could be a simple DDS config. GitHub also has an engine that converts and runs on Docker-runners. So, in a way, it’s another engine implementation, and maybe we could take some configs based on that?Datacoves: The platform helps enterprises solve data analytics challenges with managed dbt, Airflow, and VS Code, adopting best practices. This approach avoids negotiating multiple SaaS contracts and reduces consulting costs without compromising data security.Datadex: Serverless and local-first Open Data Platform.BigFunctions: A framework to build a governed catalog of powerful BigQuery functions, SQL first approach. Ingesting, advanced data transforms, and serving data on a data app with a single SQL query.HoptimatorBruinIasql:: Cloud Infrastructure as data in PostgreSQLSQL Declarative-Data-Stack

SQL Compilers

SQLGlot would be a good integration to parse SQL without running. Same as SDF integration with Datafusion.

Existing Templating
Beyond tools, templating can solve some of the Jinja Template, GoLang’s template package, biGENIUS Template modules, Apache Velocity, Liquid, and many others.

# Alternatives

# What’s the difference between a Composable Data System

Composable Data Stacks or System sounds very similar. Or also Multi-Engine Data Stacks. Are all of the same, but different wording?

# Testing

We need Deterministic Simulation Testing for DDS, like libSQL does for the sqlite rewrite.

# Declarative Data Stack ENGINE

Engine is an important part, which I go into more details in Designing a Declarative Data Stack: From Theory to Practice | ssp.sh, similar to Markdown can be the code, and HackMD, GDocs and other are engines to run it.

Key distinctions: part-3-example-implementation-declarative-data-stack

Docker and Kubernetes are other engines. There are many more, see part-3-example-implementation-declarative-data-stack.

# Further Reading


Origin: Rill | The Rise of the Declarative Data Stack, The Rise of the Declarative Data Stack - Rill
References:
Created 2024-10-17

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

声明式数据栈 数据工程 开源工具 企业级应用 数据管理
相关文章