Temporal Blog 09月30日
Temporal提升应用可靠性
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Temporal通过云服务提升应用可靠性。自托管Temporal面临多组件维护挑战,包括数据库、服务器服务,需跨区域部署以确保高可用性。Temporal Cloud提供专业管理,优化数据库和服务配置,支持99.99%服务目标,跨三个可用区部署确保容错性。用户仅需管理Workers和Workflows,Temporal Cloud负责后台维护,适合关键任务应用。

🔧 自托管Temporal需维护多个独立组件:需部署高可用数据库(如Cassandra或Postgres)并跨区域配置,同时调配四项服务器服务资源以避免请求瓶颈,对运维团队要求高。

🌐 Temporal Cloud通过专业管理提升可靠性:团队具备深厚经验,负责优化数据库和服务配置,用户仅需部署应用逻辑,大幅降低运维负担。

📈 高可用性保障:提供99.99%服务目标(SLO)和99.9%合约SLA,默认跨三可用区部署实现容错,故障恢复时间(RTO/RPO)极短,支持区域级灾难恢复。

🚀 新功能提升容灾能力:即将推出多区域命名空间,支持主备区域自动切换,历史事件异步同步,进一步强化99.99%SLA,适合业务连续性要求高的场景。

Temporal makes your applications more reliable. But from an operational perspective, any complex software is hard to run reliably at scale. In this post, we’ll give a brief overview on the challenges with self-hosting Temporal at scale, and the ways in which Temporal Cloud provides high availability. For more details, you can watch our webinar recording on this topic.

Challenges of maintaining high availability when self-hosting Temporal#

The core challenge of achieving high availability with Temporal is that the Service is composed of multiple independently scalable components. You must tune each and maintain their availability:

    A database, typically Cassandra or Postgres, which is usually sharded and deployed in a highly available way, preferably across multiple availability zones. Four independent services that make up the Temporal Server. These services must be resourced properly so there are no bottlenecks in the critical path of serving requests. As with any distributed system, failures are inevitable, and understanding how to operate under different failure conditions is necessary to keep the service stable and available at all times. Some failures are relatively easy to deal with (a machine going down), while some are subtle and require careful attention (a network partition).

Managing each of these services at smaller scales is straightforward. But to run them at scale in production, you must have a lot of expertise. That’s not to say it’s impossible. Many developers successfully self-host Temporal. But they may have difficulty meeting high availability SLAs, and often spend significant time and resources operating Temporal. For mission-critical applications and high-scale use cases, we always recommend evaluating Temporal Cloud.

High availability with Temporal Cloud#

With Temporal Cloud, our team delivers Temporal-as-a-service. We properly tune the supporting database and services for your load, and ensure they’re highly available. Because our team has deep Temporal expertise and manages thousands of namespaces, we can provide better service reliability, higher availability, lower latency, and we have a higher buffer of resources reserved for unexpected events.

As a Temporal Cloud customer, you're only responsible for deploying and managing your Workers and Workflows in your applications, and connecting your application to your managed Temporal Service.

Here are the details of the high availability guarantees Temporal Cloud provides:

    Fault tolerance - Temporal Cloud namespaces are deployed across three availability zones for fault tolerance by default. So any AZ failure would be a non-event for your namespace. 99.99% service level objective (SLO) - As a service, Temporal Cloud regularly provides four 9’s of availability; in other words, that’s the availability of the endpoint. 99.9% service level agreement (Contractual SLA) - the Temporal Cloud Contractual SLA is based on the average number of gRPC service errors over five minute intervals for the month. Contractually, if we do not meet this objective, we will issue back Cloud credits based on the outage.

For disaster recovery, Temporal Cloud provides the following:

    RTO/RPO for availability zone failures: the RTO/RPO are zero for availability zone failures, due to Temporal Cloud being replicated across multiple availability zones RTO/RPO for region failures: the RTO/RPO are eight hours at maximum, which is two backup periods for Temporal Cloud. COMING SOON: Multi-Region Namespaces: currently in pre-release, this capability will provide failover capabilities to mitigate service outages due to regional failures. It will also extend our contractual SLA to 99.99%. With Multi-Region Namespaces, your cloud service will be defined by a primary cloud region and a standby cloud region. History events automatically ship into the standby region asynchronously. In the event of the primary region failure, you can manually switch traffic to the standby region without disrupting ongoing Workflows. We recommend this capability if disruption of your workflow will cause loss of revenue, poor end-user experience, or issues with regulatory compliance.

This is just a brief overview of the topic of high availability in Temporal Cloud. For more details, we recommend watching the webinar:

This post is part of a series about Temporal Cloud. Check out the other posts below:

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Temporal 高可用性 云服务 容错性 灾难恢复
相关文章