ByteByteGo 10月01日 23:44
Airbnb 如何在 Kubernetes 上实现高可用分布式数据库
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Airbnb 工程师团队面临在 Kubernetes 上可靠运行分布式数据库的挑战。他们采用了跨多个 AWS 可用区部署数据库集群的独特方法,以增强可靠性和容错能力。文章详细介绍了他们如何利用自定义 Kubernetes Operator、AWS EBS 卷和持久卷声明来安全地管理节点替换,以及如何通过分层安全措施和多集群部署策略来应对 Kubernetes 升级的风险。此外,文中还探讨了如何通过副本读取和陈旧读取等策略来缓解 EBS 存储的延迟问题,最终构建了一个高可用、低延迟且经济高效的数据库系统,能够处理每秒数百万的查询。

💾 **跨多可用区部署数据库集群:** Airbnb 创新性地将分布式数据库集群部署到多个独立的 Kubernetes 集群中,每个集群映射到不同的 AWS 可用区(AZ)。这种设计能够实现故障隔离,确保一个可用区的故障不会影响整个数据库的可用性,并能显著减小部署失败的“爆炸半径”,提高系统的整体韧性。

⚙️ **利用自定义 Kubernetes Operator 管理节点替换:** 针对数据库的“有状态”特性,Airbnb 开发了自定义 Kubernetes Operator。该 Operator 能够智能地协调节点替换过程,通过检查节点健康状况、利用 admission hooks 拦截和控制基础设施驱动的驱逐,以及在发生非计划性故障时暂停其他替换操作,从而确保集群在节点频繁变动时仍能维持多数派(quorum)并保持稳定。

💡 **通过 EBS 卷和读策略缓解存储延迟:** 为解决 AWS EBS 存储的潜在延迟问题,Airbnb 采取了多项措施。通过使用 EBS 卷实现快速的卷重挂载,并结合存储读取超时和重试机制。此外,他们还优化了读取策略,允许在副本节点上进行读取,以及为对数据实时性要求不高的场景启用陈旧读取,从而在保证数据一致性的同时,有效提升了系统的响应速度和可用性。

🚀 **分阶段 Kubernetes 升级策略:** 为了降低 Kubernetes 升级带来的风险,Airbnb 选择了自管理的 Kubernetes 集群,并采用分阶段的升级策略。他们会逐个升级集群,优先选择较小或非关键的集群进行测试。由于数据库本身跨多个集群部署,即使某个集群在升级过程中出现短暂不稳定,其他集群也能继续提供服务,从而确保了整体系统的连续可用性。

Stop Agent Hallucinations with Project Rules (Warp University) (Sponsored)

AI coding agents are here, but harnessing that power without agents going off the rails isn’t always simple.

Warp’s Project Rules solves that. Warp is the top-ranked coding agent (outperforming Claude Code, Cursor, and Codex on benchmarks) and is trusted by 700K+ developers.

In this Warp University lesson, you’ll learn how to use Project Rules to give agents reliable context, so they stop hallucinating and start shipping.

WATCH NOW


Disclaimer: The details in this post have been derived from the official documentation shared online by the Airbnb Engineering Team. All credit for the technical details goes to the Airbnb Engineering Team. The links to the original articles and sources are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them.

For years, companies relied on large, expensive standalone servers to run their databases. As traffic increased, the standard approach was to implement sharding, which involved dividing the data into smaller pieces and distributing them across multiple machines. While this worked for a time, it became a heavy burden. Maintaining shards, handling upgrades, and keeping the whole system flexible turned into a complex and expensive problem.

The last decade has seen the rise of distributed, horizontally scalable open-source SQL databases. These systems allow organizations to spread data across many smaller machines rather than relying on a single giant one.

However, there’s a catch: running such databases reliably in the cloud is far from simple. It’s not just about spinning up more servers. You need to ensure strong consistency, high availability, and low latency, all without increasing costs. This balance has proven tricky for even the most advanced engineering teams.

This is where Airbnb’s engineering team took an unusual path.

Instead of limiting a database cluster to one Kubernetes environment, they chose to deploy distributed database clusters across multiple Kubernetes clusters, each one mapped to a different AWS Availability Zone.

This is not a common design pattern. Most companies avoid it because of the added complexity. But Airbnb’s engineers saw it as the best way to ensure reliability, reduce the impact of failures, and keep operations smooth.

In this article, we will look at how Airbnb implemented this design and the challenges they faced.


Help us Make ByteByteGo Newsletter Better

TL:DR: Take this 2-minute survey so I can learn more about who you are,. what you do, and how I can improve ByteByteGo

Take the ByteByteGo Survey


Running Databases on Kubernetes

Kubernetes is very good at running stateless workloads. A workload is called stateless when it does not need to remember anything between requests. For example, a web server that simply returns HTML pages is stateless. If one server goes down, another can immediately take over, because there is no important memory or data tied to a single machine.

See the diagram below:

Databases, however, are stateful. They must keep track of data, store it reliably, and make sure that changes are not lost.

Running stateful systems on Kubernetes is harder because when a database node is replaced or restarted, the system must ensure the stored data is not corrupted or lost.

One of the biggest risks comes from node replacement. In a distributed database, data is stored across several nodes. To guarantee correctness, the cluster depends on a quorum. A quorum means that a majority of nodes (for example, 2 out of 3 or 3 out of 5) must agree on the current state of data. If too many nodes fail or get replaced at the wrong time, the quorum is lost and the database can stop serving requests. Kubernetes, by default, does not understand how the data is spread across nodes, so it cannot prevent dangerous replacements.

The Airbnb engineering team solved this problem with a few smart techniques:

See the diagram below that shows the concept of a PVC in Kubernetes:

By combining EBS volumes, PVCs, and a custom operator, Airbnb has found a way to safely run a distributed database on Kubernetes. This allows them to benefit from Kubernetes automation without sacrificing the consistency or availability that a database requires.

Node Replacement Coordination

In any large system, database nodes will eventually need to be replaced. This can happen for several reasons, and Airbnb groups them into three categories:

Each of these situations requires a different handling strategy.

Here’s how Airbnb handles them:

In short, Airbnb built a layered safety net. Their custom operator works with admission hooks to serialize planned replacements, while also pausing other maintenance when an unplanned failure is already in progress. This careful coordination is what keeps the database cluster available even in the face of node churn.

Kubernetes Upgrades

Keeping Kubernetes up to date is important for security, stability, and access to new features. However, upgrades are particularly risky when running databases. Unlike stateless services, a database cannot simply restart without consequences, because it needs to protect the integrity and availability of stored data.

One of the main problems is with cloud-managed Kubernetes services. In these systems, when the control plane (the central component that manages all the worker nodes and scheduling) is upgraded, there is usually no way to roll it back if something goes wrong. This means a bad upgrade can lead to serious disruption, and the only option is to wait while the provider fixes it.

To reduce this risk, the Airbnb engineering team uses self-managed Kubernetes clusters. In a self-managed setup, engineers have full control over the cluster and can perform a rollback if an upgrade causes issues. This is safer, but it does not completely remove the risk. If a rollback is needed, the database could still experience downtime or instability until the rollback is finished.

Airbnb’s strategy is to rely on careful sequencing of upgrades and its multi-cluster deployment design. Instead of upgrading everything at once, they upgrade one cluster at a time, starting with a smaller or less critical cluster. Since each database spans multiple clusters across different availability zones, even if one cluster is temporarily unstable during an upgrade, the others continue serving traffic. This greatly reduces the blast radius of a bad upgrade and helps ensure that the overall system remains available.

Multi-Cluster Deployment for Fault Tolerance

One of the most important choices Airbnb made was to deploy each database across three separate Kubernetes clusters, with each cluster running in a different AWS Availability Zone (AZ).

An Availability Zone is like an independent data center within a region, with its own power, networking, and hardware. By spreading clusters across three AZs, Airbnb ensures that the failure of one zone does not take down the entire database.

See the diagram below:

This setup brings several benefits:

By combining these practices, Airbnb created a database architecture that tolerates failures gracefully. During one incident, a faulty configuration abruptly terminated all virtual machines of a specific type in Airbnb’s staging environment. This wiped out most of the query layer pods in that cluster. However, because the database was running in three clusters, the other two clusters continued operating normally. This prevented the problem from affecting users.

Leveraging AWS EBS

For storage, Airbnb chose to use Amazon Elastic Block Store (EBS). EBS is a block-level storage service that provides highly durable and reliable disks for cloud workloads. It plays a key role in making Airbnb’s database clusters both resilient and efficient.

EBS helps in the following manner:

Fast reattachment during node replacement: If a node goes down, its EBS volume can be detached and quickly reattached to a new machine. This makes recovery much faster compared to copying data onto a fresh disk.

Higher durability compared to local disks: Local disks are tied to the machine they run on. If the machine fails, data can be lost. EBS, on the other hand, is built to survive machine failures, which provides much stronger guarantees for storing critical database information.

Reliable clusters with fewer replicas: Thanks to the durability of EBS, Airbnb can confidently run its database with just three replicas of the data. Without EBS-level durability, they might need more replicas to guard against data loss, which would increase costs and complexity.

The Latency Challenge

Despite its durability, EBS is not perfect. At times, it experiences tail latency spikes, where a small fraction of requests take much longer than normal.

In Airbnb’s case, the p99 latency (the 99th percentile of requests) sometimes reached up to one full second. For a high-throughput database, even rare one-second delays can cause problems.

Some solutions Airbnb implemented are as follows:

By combining these techniques, Airbnb reduced the impact of EBS latency spikes while maintaining the reliability benefits of EBS. The result is a system that can handle millions of queries per second without sacrificing consistency or performance for critical workloads.

Conclusion

Airbnb’s engineering team set out to solve one of the hardest challenges in modern infrastructure: running a distributed database reliably on Kubernetes.

By carefully combining Kubernetes operators, multi-cluster deployments, AWS EBS, and advanced read strategies like replica reads and stale reads, they were able to build a system that is both highly available and cost-efficient.

This design is not just a theory. It is already running in production. Airbnb operates several clusters today, with its largest cluster reaching impressive numbers:

With this setup, Airbnb has achieved the goals they were aiming for: high availability, low latency, scalability, and reduced maintenance costs. The system can survive node replacements, infrastructure failures, and even the loss of an entire availability zone without bringing down the service.

The key takeaway is that open-source distributed databases can succeed on Kubernetes when combined with thoughtful engineering. Airbnb’s work shows that even though running stateful systems in Kubernetes is difficult, the benefits in automation, resilience, and cost efficiency make it worth the effort.

For other engineering teams, Airbnb’s approach provides a roadmap: start small, use operators for safety, design for fault isolation across clusters, and build strategies to handle the edge cases like storage latency. With these practices, high availability in the cloud becomes not just possible but sustainable at scale.

References:


SPONSOR US

Get your product in front of more than 1,000,000 tech professionals.

Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.

Space Fills Up Fast - Reserve Today

Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing sponsorship@bytebytego.com.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Airbnb Kubernetes 分布式数据库 高可用 AWS EBS 容错 数据库运维 Airbnb Kubernetes Distributed Databases High Availability AWS EBS Fault Tolerance Database Operations
相关文章