Second Brain: Crafted, Curated, Connected, Compounded on 10月02日 21:18
现代数据栈:赋能数据洞察的开源工具集
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

现代数据栈(MDS)是一套用于端到端数据分析的开源工具集合,涵盖数据摄取、转换、机器学习以及集成到列式数据仓库或湖解决方案,并辅以BI仪表板后端。MDS的灵活性允许扩展至数据质量、数据目录等领域,旨在通过为每个环节选择最合适的工具来实现数据洞察。该术语相对较新,定义仍在不断演进,也有诸如ngods(新一代开源数据栈)和DataStack 2.0等同义或相关概念的提出。与闭源方案不同,MDS强调开放性和工具选择的自由度,能有效降低数据管理和报告的复杂性,尤其对中小型企业而言,能显著提升数据处理效率。

💡 **现代数据栈(MDS)的核心理念**:MDS是一套集成的开源工具,旨在覆盖数据分析的整个生命周期,包括数据摄取、转换、机器学习以及最终的数据仓库或数据湖集成,并提供BI仪表板支持。其核心优势在于为数据分析的每个环节提供最优的工具选择,从而实现高效的数据洞察。不同于传统的、集成的解决方案,MDS鼓励“解绑”式的使用,让用户可以根据具体需求灵活组合工具。

🛠️ **开源工具的优势与选择**:MDS的核心是拥抱开源,提供诸如Airbyte(数据集成)、dbt(SQL转换)、Dagster(编排)、DuckDB(本地OLAP)等替代闭源方案(如Snowflake, Fivetran, Looker)的选项。这种选择的自由度使得用户能够根据成本、功能和社区支持等因素,构建最适合自身需求的栈。文章中也提到了PRQL作为DuckDB的扩展,以及Dagster在编排方面的强大能力,甚至有人认为它们可以替代整个MDS。

🚀 **MDS的价值与实践**:MDS能够显著降低数据管理和分析的门槛,尤其对于资源有限的中小型企业,甚至单个数据工程师也能高效处理大规模数据集。通过合理的工具组合,可以减少对庞大团队和基础设施的依赖。例如,将Airbyte用于数据集成,dbt进行SQL转换,Dagster进行流程编排,并根据需要集成Spark、Delta Lake、Amundsen、Great Expectations等工具,可以构建出强大且灵活的数据平台。对于没有数据工程资源的团队,Ascend或Foundry等闭源方案也是可行选项。

The Modern Data Stack (MDS) comprises a suite of open-source tools designed for end-to-end analytics. This includes data ingestion, transformation, machine learning, and integration into a columnar data warehouse or lake solution, all complemented by an analytics BI dashboard backend. The stack’s versatility allows extensions for data quality, data cataloging, and more.

MDS aims to enable data insights using the best-suited tools for each process. It’s worth noting that “Modern Data Stack” is a relatively new term, with its definition still evolving.

Synonym Names

A burgeoning term, ngods (new generation open-source data stack), has emerged. Previously, I’ve referred to this concept as the Open Data Stack Project. Additionally, Dagster introduced the term DataStack 2.0 in a recent blog post. Open Data Stack is my own definition of it.

Closed Source vs Open Source

Closed Source examples: dbt, Looker, Snowflake, Fivetran, Hightouch, Census
Open Source alternatives: airbyte, dbt, dagster, Superset, Reverse-ETL?

Modern Data Stack on a Laptop

DuckDB: Modern Data Stack in a Box

# Why the Modern Data Stack?

A perspective from Reddit highlights the shift in data warehousing and analytics. It underscores the reduced need for extensive teams and infrastructure, thanks to new tools that streamline data management and reporting. Particularly for small and mid-sized companies, MDS offers a competitive edge in data handling, allowing even a single data engineer to manage vast datasets efficiently.

A notable article discussing Lakehouse, Metrics Layer, and Clickhouse:
The Next Cloud Data Platform | Greylock

# Integrating with Dagster

The downside of MDS is the unbundling of Bundling vs Unbundling- Monolith Data vs Microservices, but Dagster helps integrate the full data stack togethhttps://www.ssp.sh/brain/Pasted%20image%2020220428103513.pnhttps://www.ssp.sh/brain/Pasted%20image%2020220428103632.png/Pasted%20image%2020220428103632.png">
https://www.ssp.sh/brain/Pasted%20image%2020220428103934.pnhttps://www.ssp.shhttps://www.ssp.sh/brain/Pasted%20image%2020220428103939.png428103934.png">

Explore more about its power with Dagster and Data Assets.

I often ponder over the ideal tools for a data stack. My preference leans toward a Cloud Data Warehouse such as Firebolt, Snowflake, BigQuery, Redshift, or Synapse, as a starting point.

The journey typically begins with Airbyte for data integration, followed by SQL-based transformation with dbt. Orchestrating the processes in Python with tools like dagster is crucial.

From there, I would integrate additional open-source tools based on specific needs: Spark for processing, Delta Lake for data lake formatting and ACID Transactions, Amundsen for data cataloging, and Great Expectation for data quality, among others. For smaller projects, DuckDB is suitable for local OLAP scenarios, while Kubernetes and DevOps provide scalability.

For teams without data engineering resources, closed-source options like Ascend or Foundry are viable alternatives.

Feel free to reach out for further discussion or clarifications.

# Modern Data Stack Alternatives

Some say PRQL as a DuckDB Extension and Dagster only could replace the whole MDS stack:

I’m with you at least 2/3 of the way. My preferred stack is PRQL + DuckDB + Dagster. I evaluated the space for work at my current company (was originally only DE, handling ingests from ~300 sources across various systems, on order of ~1k downstream tables in dbt + hundreds of dashboards + a handful of business-critical data+app features; now leading a small team).

I came away ranking dagster first, prefect second, everything else not close. IMO dagster wins fundamentally for data engineers bc it picks the right core abstraction (software defined assets) and builds everything else around that. Prefect for me is best for general non-data-specfic orchestration as a nearly transparent layer around existing scripts. PRQL as a DuckDB Extension | Hacker News

# Other Data Stacks


Origin: Ask Simon | astorik
References: Cloud Data Warehouses Data Orchestrators Dagster Amundsen Data Catalog Python dbt Superset Metrics Layer Kubernetes Use closed-source if you don’t have the developers or the time Ascend.io Palantir Foundry Wiki
Created 2022-03-25

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Modern Data Stack MDS 开源数据栈 数据分析 数据工程 Airbyte dbt Dagster DuckDB
相关文章