Second Brain: Crafted, Curated, Connected, Compounded on 10月02日 21:29
API与查询引擎:现代数据工程的新范式
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

文章探讨了现代数据工程中API和查询引擎的关键作用。API作为统一的GraphQL端点,负责路由请求至相应的数据服务。查询引擎则能将通用的GraphQL查询转化为特定数据源的查询语言,解决了不同数据存储的查询差异问题,并支持中心化的指标定义。此外,文章还介绍了Metrics Layer、Data Catalog、Orchestration(以Dagster为例,强调其在业务逻辑和代码复用上的优势)以及SQL Connector(通过ODBC/JDBC驱动和Apache Calcite实现SQL的通用接口),共同构建了一个更高效、可扩展的数据工程体系。

🚀 API作为统一入口:文章强调API是所有工具访问的单一GraphQL端点,充当代理、路由器或网关,将查询转发给正确服务。这种设计简化了多数据源的访问,实现了更集中的数据管理和访问控制。

💡 查询引擎实现通用性:查询引擎是关键的中间层,它能将通用的GraphQL查询转化为特定数据源的查询语言(如SQL)。这使得定义一次指标(如distinct_users)即可应用于所有系统,避免了为不同数据存储编写冗长复杂查询的麻烦,提升了效率和一致性。

⚙️ Orchestration与代码复用:文章推荐使用如Dagster这样的工具进行业务逻辑和数据转换的编排。Dagster通过Python代码实现可测试、可扩展的逻辑,并利用Resources和Ops等概念,促进代码复用和管道的模块化,例如连接数据库、创建Delta表或启动Spark作业。

🔗 SQL Connector提供通用SQL接口:为了兼容SQL生态系统,SQL Connector提供了ODBC或JDBC驱动,并利用Apache Calcite,使得各种BI和SQL工具能够无缝访问数据,并能够理解和使用中心化的指标和维度信息,无需额外的适配工作。

Let’s now look into each component in more detail and what they effectively do.

# API and Query Engine

The first component of the Analytics API is the interface and Query Engine. This interface is the single GraphQL endpoint that all tools access. Call it a proxy, router or gateway, which forwards every query, mutation or subscription to the correct service or pipeline.

The query engine helps if you have central calculated measures or any data stores that do not speak SQL, you translate the GraphQL query to that specific query language. A critical separation from the SQL Connector uses advanced and more general patterns to query data. E.g. instead of SELECT COUNT(DISTINCT userid) AS distinct_users FROM customers we would be more generalised with:

123
SELECT {{ metrics.distinct_users }} FROM {{ datasources.customers }}--orSELECT * FROM {{ metrics.metric('customer', by='day', dims=['country', 'year']) }}

For that, we need an intermediate layer to translate the generic query to an actual SQL-Query, the Query Engine.

I hope you notice the benefits and the small revolution for all business intelligence engineers here. We have one definition instead of writing long and complex queries for all data stores with slightly different syntax. And rather than defining the metrics such as distinctUsers in various places, we store it once and apply it to all systems. No need to worry if you got the latest version or if anyone changed the calculation. More on how you store one metric definition centrally in the next chapter.

We're seeing more abstractions emerging in the transform layer. The metrics layer (popularised by Airbnb's Minerva, Transform.co, and Metriql), feature engineering frameworks (closer to MLops), A-B Testing frameworks, and a cambrian explosion of homegrown computation frameworks of all shapes and flavours. Call this "data middleware", "parametric pipelining" or "computation framework", but this area is starting to take shape. From How the Modern Data Stack is Reshaping Data Engineering

As seen on the Analytics API image above, it integrates through GraphQL with the other components to either read data from the metrics and data catalog store or trigger an update through the orchestration. There is no integral tool besides the Headless BI tools, which implements only certain parts. In The Recent Hype Around Headless BI chapter, you can find more about them.

# Metrics Layer

See Metrics Layer

# Data Catalog

See Data Catalog

# Orchestration

The orchestration part is where most of the business logic and transformation land at the end. Instead of building everything into the Query Engine directly on GraphQL, it’s better to use a proper tool to re-use code and integrate it better.

I see Dagster as the modern business rule engine where you express the logic in python code, which makes it testable and scalable compared to no-code/less-code approaches. Dagster offers tons of tools such as resources to capture the re-usable code, such as connecting to Druid, creating a delta table, and starting a spark job, all of which are used in the pipelines. Another building block in the Analytics API is an Op, which condenses your business logic as functional tasks within a data pipeline. It is well defined with typed inputs and outputs and uses context such as the above resources, making it easy to run a spark job as part of an op.

The integrations within the Analytics API is with GraphQL as Dagster has one built-in. Dagster uses this interface to query all sorts of metadata, start pipelines/sensors (mutation), or subscribe to specific info. Side-note: This does not come out of thin air, as the founder of Dagster Nick Schrock is the Co-Founder of GraphQL :wink:. Instead of running and using the Dagster UI, we use that interface for developers and abstract it away with the Analytics API.

See more on Data Orchestrators.

# SQL Connector

SQL is the data language besides python, as elaborated in earlier articles. That’s why we need to provide an interface for that as well. The SQL Connector integrates all BI, SQL speaking or legacy tools. For example, the connector mainly implements an ODBC or JDBC driver with Avatica built on Apache Calcite used by Apache Druid. With that, have a way to interface with ANSI SQL, including all our metrics and dimensions in the metrics store with no additional effort on the accessing side if the tools talk SQL.


Origin: Building an Analytics API with GraphQL: The Next Level of Data Engineering? | ssp.sh
References:
Created 2022-02-19

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

API GraphQL Query Engine Data Engineering Orchestration Metrics Layer Data Catalog SQL Connector Dagster Apache Calcite
相关文章