Second Brain: Crafted, Curated, Connected, Compounded on 10月02日
DevOps in Data Engineering
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

在数据工程中,DevOps往往被低估和忽视,但实际上占据了大量时间。DevOps或GitOps,即基础设施即代码,是采用迭代和代码优先的方式部署数据平台。数据栈作为CI/CD管道的一部分进行测试,通过Prometheus事件进行监控,并通过仪表板可视化系统健康状况。这些都被视为标准实践,但实施时却消耗了大量时间。分离关注点、标准化部署模式、版本化工件、自动化数据库迁移、早期测试、工作空间与基础设施代码分离、CI可见性和GitOps作为单一事实来源都是关键实践。

🔧 分离关注点 - 将基础设施、平台服务和业务逻辑(管道)分开,便于维护和协作。

📦 标准化部署模式 - 如Kustomize的base/overlay结构,可跨环境重用配置,最小化特定环境覆盖。

⏱️ 版本化工件 - 带时间戳的工件(如finance-pipeline-20250512123045.tar.gz)创建可靠的发布流程,支持回滚和审计。

🔄 自动化数据库迁移 - 工具如Liquibase可跨环境程序化处理模式变更,减少手动错误。

🛡️ 早期测试 - 分别验证数据管道、基础设施代码和数据库迁移,在隔离环境中进行集成测试前确保质量。

🗺️ 工作空间与基础设施代码分离 - 使领域专家专注核心技能,同时维护部署标准。

📊 CI可见性 - 通过谱系图和文档化流程,使团队理解部署流程,更高效地排查问题。

📦 GitOps作为单一事实来源 - 通过Git提交进行变更,创建自动审计跟踪,并支持拉取请求审查。

Is DevOps the new data engineering of data science? What do I mean? As in the old days, when you were doing data science but spent 80% of your time on data engineering.

Nowadays, DevOps is often underrated and neglected in data engineering projects, yet most of the time, you end up spending a significant amount of time on it.

DevOps or GitOps, also known as Infrastructure as Code, is the practice of deploying your data platform in an iterative and code-first way. The data stack gets tested as part of a CI/CD pipeline, monitored through Prometheus events, and visualized via dashboards to track system health. All of these are considered standard practices, yet when you start implementing them, they consume the majority of your time.

I recall 10 years ago when we patiently waited for the data scientist to arrive and solve all our problems. Today, it’s almost the same, yet we wait for a DevOps person who knows how to deploy, manage, and release new versions in a non-disruptive way.

# Learnings

I spent some time thinking and conceptualizing this topic, and here are 8 points I learned:

    Separation of concerns is crucial - keeping infrastructure, platform services, and business logic (pipelines) in distinct sections makes maintenance and collaboration easier.Standardized deployment patterns like the base/overlay structure with Kustomize allow for reusable configurations across environments with minimal environment-specific overrides.Versioned artifacts with timestamps (e.g., finance-pipeline-20250512123045.tar.gz) create a reliable release process that enables rollbacks and audit trails.Database migration automation tools, such as Liquibase, can handle schema changes programmatically across environments, thereby reducing manual errors.Test early, test often - validate data pipelines, infrastructure code, and database migrations separately before integration testing in an isolated environment.Workspaces separation from infrastructure code enables domain experts, such as data scientists and analysts, to focus on their core competencies while maintaining deployment standards.CI visibility through lineage diagrams and documented processes enables teams to understand the deployment flow and troubleshoot issues more efficiently.GitOps, as the single source of truth, means changes occur through Git commits, creating an automatic audit trail and enabling pull request reviews.

# After a While, it Stabilizes

Most of this is also true when you start. Once you have a set of tools, you will have your deployment scripts, and DevOps is essentially finished, except for version upgrades, until you want to add new tools.

# Starting GitOps is Hard.

Question on Bsky

Selling GitOps to new data projects is hard if they haven’t already been burned. And cleaning up the mess in retrospect is difficult and thankless.

True, that’s why it’s worth having someone or a central team that specializes in this and can complete the work in days rather than weeks. And data engineers or other people can focus on their core workload.

# Reference Example for Data Engineering

    Overall blueprint, with best practices, more as an overview kubernetes-gitops-deployment-blueprint. Specifically for Data Engineering reference architecture for Kubernetes-based data platforms using GitOps workflow patterns. Includes infrastructure configs, tenant isolation, database migrations, and observability templates for production deployments.CI/CD implementation with GitOps infrastructure example using Flux CD, Kestra workflows, and Liquibase migrations with complete CI/CD pipeline implementation: gitops-flux-pipeline-showcaseOr my data-engineering-devops repo I used four years ago back for my data engineering project, a full stack data engineering tools and infrastructure set-up with druid, kubernetes, minio-s3, notebooks, spark, and superset.

# Infrastructure as Code

Infrastructure as Code (IaC) has evolved beyond bare provisioning to include policy as code, security as code, and compliance as code. Understanding tools like Terraform, Pulumi, Helm Charts, Kubernetes, and how they integrate with specialized data infrastructure.

# Alternatives

So, with these downsides, what are the alternatives?

Suppose you choose a vendor or Hyperscalers with Closed-Source Data Platforms that include all your tools. However, you are then locked in and can’t extend features on top of closed-source tools. So it’s always a trade-off.

Also, after a while, the DevOps deployment stabilizes, and you need to invest less time.

# Further Reads

    Declarative Data Stack: The new shift to have end-to-end stacks in a single YAML file.DataOps combines DevOps practices with data analytics, focusing on improving the quality and reducing the cycle time of data analytics.Shift Left: Implementing security controls earlier in the development lifecycle (“shifting left”) is becoming essential. This includes integrating security scanning into CI/CD pipelines, implementing data access governance as code, and employing techniques such as data masking in non-production environments.Developer Experience: GitOps can increase increase data teams’ productivity.Dagster git on Azure DevOps

Origin: GitOps
References: Codespaces, Devcontainers (devcontainer.json)
Created 2025-05-14

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

DevOps 数据工程 GitOps 基础设施即代码 CI/CD 自动化 数据迁移
相关文章