ByteByteGo 08月12日
How Atlassian Migrated 4 Million Jira Databases to AWS Aurora
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文详细介绍了Atlassian如何成功迁移400万个PostgreSQL数据库至AWS Aurora PostgreSQL的复杂过程。面对数据库数量庞大、用户影响最小化的严峻挑战,Atlassian采用了创新的“draining”策略,通过逐步迁移数据库至临时的Aurora实例,规避了Aurora在启动时因文件数过多而超时的限制。该过程不仅实现了每租户少于3分钟的停机时间,还通过实例降级和自动扩展,显著提升了成本效益和可靠性,最终圆满完成了此次大规模数据库迁移。

💡 **迁移目标与挑战**: Atlassian旨在将400万个分散在13个AWS区域的Jira PostgreSQL数据库迁移至AWS Aurora PostgreSQL,以实现成本效益、可靠性和性能的提升。主要挑战在于如何在保持用户体验(停机时间少于3分钟)的同时,管理海量数据库的迁移,并控制基础设施成本。

🚀 **核心迁移策略 - 读副本与协调切换**: Atlassian采用了“转换”策略,首先为RDS PostgreSQL实例创建Aurora读副本,实现数据同步。关键在于“协调切换”:在锁定源数据库写入后,将读副本提升为独立Aurora集群,并快速更新应用连接指向新集群,最后解锁用户。AWS Step Functions被用于编排和确保整个过程的安全性。

⚠️ **文件数限制与“Draining”解决方案**: 在测试中发现,单个Jira数据库因包含大量对象(约5000个文件)导致Aurora集群启动超时。为解决此问题,Atlassian开发了“Draining”策略:将租户数据库从RDS实例移出,直至文件数低于Aurora可处理的阈值,然后进行标准转换。完成转换后,新Aurora实例可作为其他实例的“refill”目标,形成“Daisy Chain”模式。

💰 **Aurora优势与成本效益**: Aurora相比RDS提供了双实例优势,允许同时使用读写实例,这使得Atlassian能够使用更小的实例(r6.2xlarge代替m5.4xlarge),降低了成本。同时,Aurora的自动扩展能力(最多15个读实例)和更高的SLA(99.99%)也带来了显著的性能和可靠性提升,尽管实例数量增加,但整体成本效率更高。

📈 **成果与规模**: 此次迁移成功转换了2403个RDS实例,迁移了430万个数据库(含draining操作),处理了274亿个文件。最终的Aurora基础设施虽然实例数量翻倍,但通过其动态扩展能力,在非高峰时段大幅降低了成本,实现了Atlassian的目标,并可能推动了Aurora平台自身的改进。

✂️ Cut your QA cycles down to minutes with QA Wolf (Sponsored)

If slow QA processes bottleneck you or your software engineering team and you’re releasing slower because of it — you need to check out QA Wolf.

QA Wolf’s AI-native service supports web and mobiles, delivering 80% automated test coverage in weeks and helping teams ship 5x faster by reducing QA cycles to minutes.

QA Wolf takes testing off your plate. They can get you:

The benefit? No more manual E2E testing. No more slow QA cycles. No more bugs reaching production.

With QA Wolf, Drata’s team of 80+ engineers achieved 4x more test cases and 86% faster QA cycles.

Schedule a demo to learn more


Disclaimer: The details in this post have been derived from the official documentation shared online by the Atlassian Engineering Team. All credit for the technical details goes to the Atlassian Engineering Team.  The links to the original articles and sources are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them.

Most companies struggle to migrate a handful of databases without causing downtime. Atlassian routinely migrates about 1,000 databases every day as part of its regular operations, and its users never notice. 

However, in 2024, they tackled something far more ambitious: migrating 4 million databases with minimal user impact.

The numbers alone are staggering. Atlassian's Jira platform uses a one-database-per-tenant approach, spreading 4 million PostgreSQL databases across 3,000 server instances in 13 AWS regions worldwide. Each database contains everything for a Jira tenant: issues, projects, workflows, and custom fields.

Their goal was to move everything from AWS RDS PostgreSQL to AWS Aurora PostgreSQL. The benefits were compelling: 

The constraints were equally challenging. They needed to keep downtime under 3 minutes per tenant, control infrastructure costs during the migration, and complete the entire project within months rather than years.

In this article, we will look at how Atlassian carried out this migration and the challenges they faced.

Migration Strategy

Atlassian's team chose what seemed like a textbook approach for migrating from RDS to Aurora. This process is also known as “conversion”.

See the diagram below:

Here's how the process was designed to work:

See the diagram below for a detailed look at the conversion process.

Lastly, they also used feature flags that let them instantly redirect database connections. Instead of waiting for applications to slowly discover the new database endpoints (which could take time), feature flags allowed them to override the tenant database endpoints immediately across all application servers.

The most impressive part was that they managed to keep the actual cutover time to less than 3 minutes, even for their largest instances. This was possible because the heavy lifting (data synchronization) happened ahead of time through the read replica, and the feature flags enabled instant endpoint switching.

The Advantages of Aurora

The key advantages of using Aurora were as follows:

The File Count Limit Problem

Just when everything seemed to be going smoothly, Atlassian hit a wall that nobody saw coming. 

During their testing phase, AWS support contacted them with concerning news: one of their large test RDS instances had successfully synchronized all its data to Aurora, but the new Aurora cluster had failed to start up properly.

From Atlassian's perspective, everything looked fine in the AWS console. The replica appeared healthy and still replicating. However, AWS's internal monitoring had detected that the Aurora instance's startup process had timed out and failed, even though the surface indicators suggested everything was working normally.

To understand what went wrong, we need to know how PostgreSQL stores data on disk. In PostgreSQL, every high-level database object gets stored as at least one file on the server's hard drive:

However, Jira applications have particularly complex database schemas because they need to handle:

All this complexity means that a single Jira database needs approximately 5,000 files on disk to store all its tables, indexes, and sequences. For 4000 tenants per cluster, this means 20 million files per Aurora cluster.

When an Aurora instance starts up, it performs various status checks to ensure everything is healthy. One of these checks involves enumerating (counting and cataloging) all the files in the cluster volume. This process is normally quick and invisible to users. 

However, when there are 20+ million files, this enumeration process takes a very long time. Aurora has internal timeout thresholds for how long startup processes are allowed to run. With Atlassian's massive file counts, the enumeration was taking longer than Aurora's startup timeout threshold, causing the instance to fail during boot-up.

To handle this problem, the team had only two options to reduce file counts:

The solution they developed would become known as "draining".

The Draining Solution

Faced with the file count limitation, Atlassian developed an elegant solution that turned the problem into an opportunity. Instead of fighting the constraints, they embraced a new workflow that would improve their infrastructure efficiency.

See the diagram below:

Step 1: Drain

"Draining" means moving tenant databases off an RDS instance until it reaches a safe file count threshold that Aurora can handle during startup. Instead of 4,000 tenants per instance (20 million files), they needed to reduce this to a much smaller number.

The draining process used their existing database migration tools, but at a completely different scale. They prioritized moving the smallest tenants with the least usage first because:

Step 2: Convert

Once an RDS instance was drained to safe file count levels, they could proceed with the standard RDS to Aurora conversion process:

With fewer tenants, this conversion process became much more reliable and predictable.

Step 3: Refill

Instead of leaving the new Aurora clusters half-empty, they immediately started using them as destinations for draining other RDS instances.

AWS had confirmed that once an RDS instance was successfully converted to Aurora, they could re-populate it with tenant databases up to sensible limits. This meant they didn't need to permanently reduce their tenant density. They just needed to reduce it temporarily during the conversion process.

The Daisy Chain Approach

This three-step process created what Atlassian called a "daisy chain" of conversions:

This approach had a huge advantage: minimal infrastructure overhead. 

Instead of needing to provision massive amounts of temporary infrastructure to hold displaced tenants, they only needed enough extra capacity to accommodate the tenants from one or two instances being converted at any given time.

Through careful optimization and tooling improvements, they achieved remarkable scaling:

Conclusion

In the end, Atlassian had accomplished something remarkable in the world of database migrations. The final statistics paint a picture of an operation that succeeded on every measure:

Here’s a quick look at the migration scale:

The final Aurora infrastructure looked dramatically different from where they started, but in all the right ways:

Atlassian's Aurora migration represents more than just a successful database platform upgrade. Their collaboration with AWS throughout this process led to a better understanding of Aurora's operational limits and likely influenced future improvements to the platform.

The project delivered on all its original objectives: significant cost savings through better resource utilization, improved reliability through Aurora's superior SLA, and enhanced performance through auto-scaling capabilities. But perhaps most importantly, they accomplished all of this while maintaining their commitment to minimal user impact.

References:


ByteByteGo Technical Interview Prep Kit

Launching the All-in-one interview prep. We’re making all the books available on the ByteByteGo website.

What's included:

Launch sale: 50% off


SPONSOR US

Get your product in front of more than 1,000,000 tech professionals.

Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.

Space Fills Up Fast - Reserve Today

Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing sponsorship@bytebytego.com.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Atlassian AWS Aurora 数据库迁移 PostgreSQL Draining策略
相关文章