API与查询引擎：现代数据工程的新范式

Let’s now look into each component in more detail and what they effectively do.

# API and Query Engine

The first component of the Analytics API is the interface and Query Engine. This interface is the single GraphQL endpoint that all tools access. Call it a proxy, router or gateway, which forwards every query, mutation or subscription to the correct service or pipeline.

The query engine helps if you have central calculated measures or any data stores that do not speak SQL, you translate the GraphQL query to that specific query language. A critical separation from the SQL Connector uses advanced and more general patterns to query data. E.g. instead of SELECT COUNT(DISTINCT userid) AS distinct_users FROM customers we would be more generalised with:

123SELECT {{ metrics.distinct_users }} FROM {{ datasources.customers }}--orSELECT * FROM {{ metrics.metric('customer', by='day', dims=['country', 'year']) }}

For that, we need an intermediate layer to translate the generic query to an actual SQL-Query, the Query Engine.

I hope you notice the benefits and the small revolution for all business intelligence engineers here. We have one definition instead of writing long and complex queries for all data stores with slightly different syntax. And rather than defining the metrics such as distinctUsers in various places, we store it once and apply it to all systems. No need to worry if you got the latest version or if anyone changed the calculation. More on how you store one metric definition centrally in the next chapter.

We're seeing more abstractions emerging in the transform layer. The metrics layer (popularised by Airbnb's Minerva, Transform.co, and Metriql), feature engineering frameworks (closer to MLops), A-B Testing frameworks, and a cambrian explosion of homegrown computation frameworks of all shapes and flavours. Call this "data middleware", "parametric pipelining" or "computation framework", but this area is starting to take shape. From How the Modern Data Stack is Reshaping Data Engineering

As seen on the Analytics API image above, it integrates through GraphQL with the other components to either read data from the metrics and data catalog store or trigger an update through the orchestration. There is no integral tool besides the Headless BI tools, which implements only certain parts. In The Recent Hype Around Headless BI chapter, you can find more about them.

# Metrics Layer

See Metrics Layer

# Data Catalog

See Data Catalog

# Orchestration

The orchestration part is where most of the business logic and transformation land at the end. Instead of building everything into the Query Engine directly on GraphQL, it’s better to use a proper tool to re-use code and integrate it better.

I see Dagster as the modern business rule engine where you express the logic in python code, which makes it testable and scalable compared to no-code/less-code approaches. Dagster offers tons of tools such as resources to capture the re-usable code, such as connecting to Druid, creating a delta table, and starting a spark job, all of which are used in the pipelines. Another building block in the Analytics API is an Op, which condenses your business logic as functional tasks within a data pipeline. It is well defined with typed inputs and outputs and uses context such as the above resources, making it easy to run a spark job as part of an op.

The integrations within the Analytics API is with GraphQL as Dagster has one built-in. Dagster uses this interface to query all sorts of metadata, start pipelines/sensors (mutation), or subscribe to specific info. Side-note: This does not come out of thin air, as the founder of Dagster Nick Schrock is the Co-Founder of GraphQL :wink:. Instead of running and using the Dagster UI, we use that interface for developers and abstract it away with the Analytics API.

See more on Data Orchestrators.

# SQL Connector

SQL is the data language besides python, as elaborated in earlier articles. That’s why we need to provide an interface for that as well. The SQL Connector integrates all BI, SQL speaking or legacy tools. For example, the connector mainly implements an ODBC or JDBC driver with Avatica built on Apache Calcite used by Apache Druid. With that, have a way to interface with ANSI SQL, including all our metrics and dimensions in the metrics store with no additional effort on the accessing side if the tools talk SQL.

Origin: Building an Analytics API with GraphQL: The Next Level of Data Engineering? | ssp.sh
References:
Created 2022-02-19

# API and Query Engine

# Metrics Layer

# Data Catalog

# Orchestration

# SQL Connector

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签