Second Brain: Crafted, Curated, Connected, Compounded on 10月02日 21:28
数据建模的演变与现代数据工程
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

随着数据工程方法和工具的进步,数据建模领域正在经历深刻变革。Ralph Kimball的《数据仓库工具包》奠定了基础,但如今,ETL流程的改变要求我们重新思考建模策略。例如,虽然性能提升可以通过更快的数据库引擎实现,但过度反规范化已不再是唯一选择。维护业务友好的代理键、利用云存储的灵活性动态生成数据库模式、以及系统化地快照维度以简化变更跟踪,都成为新的趋势。同时,在日益增长的协作需求下,对一致性维度(Conformed Dimensions)的把握需要更加灵活。此外,随着业务人员数据素养的提升,实时数据处理和即时复杂查询能力变得愈发重要,Spark等技术为此提供了有力支持。这些变化都促使我们重新审视Kimball方法在当前数据环境中的适用性。

💡 **数据建模的现代化趋势**:现代数据工程方法和工具的兴起,特别是机器学习和人工智能(MAD)领域的快速发展,正在重塑传统数据建模。ETL流程的变化要求数据建模策略随之调整,以适应新的技术和业务需求。

🔧 **简化变更跟踪与模式管理**:相较于复杂的慢变维度(SCD Type 2)处理,系统化地进行维度快照,或将维度属性直接反规范化到事实表中,能够更简单直观地追踪数据变化,尤其是在需要保留交易发生时刻重要信息时。同时,云存储的普及使得动态创建和开发数据库模式成为可能,减少了对DML语句的依赖。

🤝 **平衡一致性与协作性**:在现代数据仓库和数据环境中,一致性维度(Conformed Dimensions)和整体的数据一致性至关重要。然而,为了促进团队协作和对共享对象的共同工作,适度放宽严格的一致性要求成为一种必要的权衡,以提高工作效率和灵活性。

🚀 **拥抱实时性与即时查询**:随着更多业务人员具备数据分析能力,数据处理正从批处理向实时化演进。Spark等新技术能够按需运行复杂作业,支持即时查询,满足了业务对实时数据洞察日益增长的需求,这与预先计算和批处理的传统模式形成对比。

Since Ralph Kimball has written the state-of-the-art book for Data Modeling called The Datawarehouse Toolkit - Ralph Kimball, data modeling is changing.

Especially with newer Data Engineering Approaches, tools land the landscape has drastically changed (see RW The 2023 MAD (Machine Learning, Artificial Intelligence & Data) Landscape).

Essentially, you can’t change ETL without modeling differently. Here are a few points that have been changed and will further change:

    Further Denormalization for performance gains is mostly compensated with faster database engines or cloud solutions.Maintaining surrogate keys in dimensions can be tricky and not human-friendly as we prefer business keys.With the popularity of document storage and cheap blobs in cloud storage, it is becoming easier to create and develop database schemas dynamically without writing DML-statements.Systematically snapshotting dimensions compared to handling complex and maybe contra-intuitive Slowly Changing Dimension (Type 2) is a way to simplify track changes in a DWH. Is it also easy and relatively cheap to denormalize dimension attributes directly on the fact table to keep important information at the moment of the transaction?Conformed dimensions and conformance, in general, are extremely important in nowadays Data Warehouses and data environments. But to be more collaborative and work on the same objects it is a necessary trade-off to loosen it up.Not only are more working on the same project within data warehousing, but also more people from business and other departments getting more data-savvy than ever before. In that sense data needs to get more real-time rather than batch processing and precompute calculations, this can be done more ad-hoc with new fast technologies like Spark that ran complex jobs ad-hoc and on-demand.

See more on Babies and bathwater- Is Kimball still relevant.


References: Education is changing ETL is changing

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

数据建模 数据工程 ETL 数据仓库 Ralph Kimball Data Modeling Data Engineering Data Warehouse
相关文章