DevOps in Data Engineering

Is DevOps the new data engineering of data science? What do I mean? As in the old days, when you were doing data science but spent 80% of your time on data engineering.

Nowadays, DevOps is often underrated and neglected in data engineering projects, yet most of the time, you end up spending a significant amount of time on it.

DevOps or GitOps, also known as Infrastructure as Code, is the practice of deploying your data platform in an iterative and code-first way. The data stack gets tested as part of a CI/CD pipeline, monitored through Prometheus events, and visualized via dashboards to track system health. All of these are considered standard practices, yet when you start implementing them, they consume the majority of your time.

I recall 10 years ago when we patiently waited for the data scientist to arrive and solve all our problems. Today, it’s almost the same, yet we wait for a DevOps person who knows how to deploy, manage, and release new versions in a non-disruptive way.

# Learnings

I spent some time thinking and conceptualizing this topic, and here are 8 points I learned:

Separation of concerns

Standardized deployment patterns

Kustomize

Versioned artifacts

Database migration

Liquibase

Test early, test often

Workspaces separation

CI visibility

single source of truth

# After a While, it Stabilizes

Most of this is also true when you start. Once you have a set of tools, you will have your deployment scripts, and DevOps is essentially finished, except for version upgrades, until you want to add new tools.

# Starting GitOps is Hard.

Question on Bsky
Selling GitOps to new data projects is hard if they haven’t already been burned. And cleaning up the mess in retrospect is difficult and thankless.

True, that’s why it’s worth having someone or a central team that specializes in this and can complete the work in days rather than weeks. And data engineers or other people can focus on their core workload.

# Reference Example for Data Engineering

Overall blueprint, with best practices

kubernetes-gitops-deployment-blueprint

CI/CD implementation with GitOps

gitops-flux-pipeline-showcase

data-engineering-devops

data engineering project

# Infrastructure as Code

Infrastructure as Code (IaC) has evolved beyond bare provisioning to include policy as code, security as code, and compliance as code. Understanding tools like Terraform, Pulumi, Helm Charts, Kubernetes, and how they integrate with specialized data infrastructure.

# Alternatives

So, with these downsides, what are the alternatives?

Suppose you choose a vendor or Hyperscalers with Closed-Source Data Platforms that include all your tools. However, you are then locked in and can’t extend features on top of closed-source tools. So it’s always a trade-off.

Also, after a while, the DevOps deployment stabilizes, and you need to invest less time.