现代数据栈：赋能数据洞察的开源工具集

The Modern Data Stack (MDS) comprises a suite of open-source tools designed for end-to-end analytics. This includes data ingestion, transformation, machine learning, and integration into a columnar data warehouse or lake solution, all complemented by an analytics BI dashboard backend. The stack’s versatility allows extensions for data quality, data cataloging, and more.

MDS aims to enable data insights using the best-suited tools for each process. It’s worth noting that “Modern Data Stack” is a relatively new term, with its definition still evolving.

Synonym Names
A burgeoning term, ngods (new generation open-source data stack), has emerged. Previously, I’ve referred to this concept as the Open Data Stack Project. Additionally, Dagster introduced the term DataStack 2.0 in a recent blog post. Open Data Stack is my own definition of it.

Closed Source vs Open Source
Closed Source examples: dbt, Looker, Snowflake, Fivetran, Hightouch, Census
Open Source alternatives: airbyte, dbt, dagster, Superset, Reverse-ETL?

Modern Data Stack on a Laptop
DuckDB: Modern Data Stack in a Box

# Why the Modern Data Stack?

A perspective from Reddit highlights the shift in data warehousing and analytics. It underscores the reduced need for extensive teams and infrastructure, thanks to new tools that streamline data management and reporting. Particularly for small and mid-sized companies, MDS offers a competitive edge in data handling, allowing even a single data engineer to manage vast datasets efficiently.

A notable article discussing Lakehouse, Metrics Layer, and Clickhouse:
The Next Cloud Data Platform | Greylock

# Integrating with Dagster

The downside of MDS is the unbundling of Bundling vs Unbundling- Monolith Data vs Microservices, but Dagster helps integrate the full data stack togethhttps://www.ssp.sh/brain/Pasted%20image%2020220428103513.pnhttps://www.ssp.sh/brain/Pasted%20image%2020220428103632.png/Pasted%20image%2020220428103632.png">
https://www.ssp.sh/brain/Pasted%20image%2020220428103934.pnhttps://www.ssp.shhttps://www.ssp.sh/brain/Pasted%20image%2020220428103939.png428103934.png">

Explore more about its power with Dagster and Data Assets.

I often ponder over the ideal tools for a data stack. My preference leans toward a Cloud Data Warehouse such as Firebolt, Snowflake, BigQuery, Redshift, or Synapse, as a starting point.

The journey typically begins with Airbyte for data integration, followed by SQL-based transformation with dbt. Orchestrating the processes in Python with tools like dagster is crucial.

From there, I would integrate additional open-source tools based on specific needs: Spark for processing, Delta Lake for data lake formatting and ACID Transactions, Amundsen for data cataloging, and Great Expectation for data quality, among others. For smaller projects, DuckDB is suitable for local OLAP scenarios, while Kubernetes and DevOps provide scalability.

For teams without data engineering resources, closed-source options like Ascend or Foundry are viable alternatives.

Feel free to reach out for further discussion or clarifications.

# Modern Data Stack Alternatives

Some say PRQL as a DuckDB Extension and Dagster only could replace the whole MDS stack:

I’m with you at least 2/3 of the way. My preferred stack is PRQL + DuckDB + Dagster. I evaluated the space for work at my current company (was originally only DE, handling ingests from ~300 sources across various systems, on order of ~1k downstream tables in dbt + hundreds of dashboards + a handful of business-critical data+app features; now leading a small team).
I came away ranking dagster first, prefect second, everything else not close. IMO dagster wins fundamentally for data engineers bc it picks the right core abstraction (software defined assets) and builds everything else around that. Prefect for me is best for general non-data-specfic orchestration as a nearly transparent layer around existing scripts. PRQL as a DuckDB Extension | Hacker News

# Other Data Stacks

# Further Links

Origin: Ask Simon | astorik
References: Cloud Data Warehouses Data Orchestrators Dagster Amundsen Data Catalog Python dbt Superset Metrics Layer Kubernetes Use closed-source if you don’t have the developers or the time Ascend.io Palantir Foundry Wiki
Created 2022-03-25

# Why the Modern Data Stack?

# Integrating with Dagster

# Modern Data Stack Alternatives

# Other Data Stacks

# Further Links

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签