Temporal Blog 09月30日
Temporal Cloud架构实践与经验
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Temporal Cloud通过重新思考传统架构和运营策略,在云服务领域提供了卓越的用户体验。文章探讨了Temporal Cloud在租户模型、资源管理和持久执行方面的决策,以及从中获得的经验教训。文章重点介绍了多租户的优势、数据平面和控制平面的架构设计,以及持久执行模型如何简化复杂工作流的管理。通过细胞式架构和Temporal的持久执行功能,Temporal Cloud实现了高可用性、可扩展性和可靠性,为开发者提供了更高效的开发体验。

Temporal Cloud采用多租户模型优化资源使用,允许客户按实际使用付费,并创建共享的流量高峰处理能力,从而提高了效率和成本效益。

数据平面和控制平面的架构设计清晰定义了各自的角色:数据平面负责处理交易、执行工作流和管理客户数据,必须保持高可用性、低延迟和故障恢复能力;控制平面作为系统的核心,管理资源、分配命名空间和配置,虽然性能要求不如数据平面,但可靠性仍然重要。

Temporal Cloud采用细胞式架构来实现数据平面的强隔离和可扩展性,每个细胞作为一个自包含的单位运行,包括计算Pod、数据库、负载均衡器等组件,确保一个细胞的故障不会影响其他细胞,从而减少级联故障的风险。

Temporal的持久执行模型通过将业务逻辑与状态管理和故障处理分离,简化了复杂工作流的管理。开发者可以专注于编写简单的、面向成功路径的代码,而系统会自动处理重试、错误处理和状态持久化,提高了开发效率和系统可靠性。

在控制平面中,Temporal的工作流模式通过命名空间创建和版本发布等场景展示了其强大的适用性,确保了长时间部署的弹性和稳定性,并通过分阶段部署策略(如合成流量、低优先级流量到高优先级流量)降低了部署风险。

In today’s world of managed cloud services, delivering exceptional user experiences often requires rethinking traditional architecture and operational strategies. At Temporal, we faced this challenge head-on, navigating complex decisions about tenancy models, resource management, and durable execution to build a reliable, scalable cloud service. This post explores our approach and the lessons we learned while creating Temporal Cloud.

The Case for Managed Cloud Services#

Managed services have become the default for delivering hosted solutions to customers. Whether it’s a database, queueing system, or another server-side technology, hosting a service not only provides a better user experience but also opens doors for monetization, especially for open-source projects. The challenge is how to do it effectively while maintaining reliability and scalability.

One of the first decisions we made was about tenancy models. Should we pursue single-tenancy — provisioning dedicated clusters for each customer — or opt for multi-tenancy, which allows multiple customers to share the same resources? While single-tenancy offers simplicity and isolation, its inefficiencies quickly become apparent. Customers end up paying for unused capacity, and providers shoulder higher operational costs. Multi-tenancy, though harder to implement, emerged as the clear winner. It optimizes resource usage, allows customers to pay for actual usage, and creates shared headroom for handling traffic spikes.

Data Plane vs. Control Plane: Defining Responsibilities#

Architecting a managed service in terms of the data plane and control plane is an industry best practice that we followed, clearly defining and implementing their distinct roles within our cloud architecture.

    Data Plane: This is where the actual work happens — processing transactions, executing workflows, and handling customer data. It must maintain high availability, low latency, and resilience to failures. For Temporal Cloud, we adopted a cell-based architecture to isolate resources and minimize the blast radius of potential failures. Control Plane: This acts as the brain of the system, managing resources, provisioning namespaces, and handling configurations. While its performance is less critical than the data plane, reliability here still matters for customer experience. For instance, provisioning a namespace may not be urgent, but delays or errors in this process can frustrate users.

Implementing the Data Plane: A Cell-Based Architecture#

For the data plane, we applied a cell-based architecture to achieve strong isolation and scalability. Each cell operates as a self-contained unit with its own AWS account, VPC, EKS cluster, and supporting infrastructure. While this approach is framed within the context of AWS, we have applied the same principles to Google Cloud Platform (GCP), leveraging its equivalent primitives to ensure consistency and reliability across cloud providers. This approach ensures that failures or updates in one cell do not impact others, reducing the risk of cascading outages.

Each cell in Temporal Cloud includes:

    Compute Pods: Running Temporal services and infrastructure tools for observability, ingress management, and certificate handling. Databases: Both primary databases and Elasticsearch for enhanced visibility. Additional Components: Load balancers, private connectivity endpoints, and other supporting infrastructure that ensures smooth operation and integration across environments. Currently, Temporal Cloud operates across 14 AWS regions, and we’ve also added support for GCP. This architecture allows us to meet the diverse needs of our customers while maintaining reliability at scale.

Durable Execution: The Foundation of the Control Plane#

Building the control plane presented its own set of challenges, particularly around reliability and maintainability. Control plane tasks, such as provisioning namespaces or rolling out updates, involve complex long-running processes with many interdependent steps. Writing this logic as traditional, ad-hoc code often leads to brittle systems that are hard to debug and evolve.

This is where Temporal’s durable execution model shines. Designed based on experience with earlier systems like AWS Simple Workflow Service and Azure Durable Functions, Temporal’s approach separates business logic from state management and failure handling. Developers can write workflows as straightforward, happy-path code without worrying about retries, error handling, or state persistence. The system automatically manages these concerns, allowing workflows to seamlessly recover from failures.

Namespace Provisioning: A Real-World Example#

Consider the process of creating a new namespace in Temporal Cloud. When a user clicks “Create Namespace” on the web interface, the control plane orchestrates a series of tasks:

    Selecting a suitable cell within the chosen region. Creating database records and roles. Generating and provisioning mTLS certificates. Configuring ingress routes and verifying connectivity. Each step involves external API calls, DNS propagation, and other potential points of failure.

Without durable execution, managing retries, backoffs, and state persistence would result in a tangle of brittle code. With Temporal, these tasks are encapsulated in workflows, which transparently handle retries and maintain state across failures. Developers can focus on the high-level logic, confident that the system will handle the edge cases.

Rolling Upgrades: Ensuring Safe Deployments#

Another common control plane scenario is rolling out updates to the Temporal Cloud fleet. Our deployment strategy involves organizing cells into deployment rings, progressing from pre-production environments to customer-facing cells with increasing priority of traffic.

The rollout process is carefully staged:

    Ring 0: Synthetic traffic only, no customer impact. Changes are monitored here for at least a week. Ring 1: Low-priority traffic namespaces, allowing for additional testing with minimal risk. Higher Rings: Gradually expanding to critical, high-priority traffic customers. Within each ring, updates are applied in batches, with pauses between batches to observe for potential issues like memory leaks or race conditions. Temporal workflows handle this process, ensuring that even long-running deployments (which can span weeks) are resilient to failures or restarts.

Entity Workflows: A Powerful Pattern#

Temporal’s durable execution also enables powerful patterns like entity workflows. These are workflows tied to specific resources, such as cells or namespaces, providing a natural way to model state and operations. For example, each cell in Temporal Cloud has an entity workflow that manages its lifecycle, from provisioning to upgrades. This approach ensures consistency and simplifies concurrency control.

Developer Happiness and Productivity#

One of the biggest benefits of Temporal’s approach is the impact on developer experience. By eliminating the need to write boilerplate code for retries, backoffs, and state management, developers can focus on delivering business value. Temporal’s built-in tools for observing and debugging workflows further enhance productivity, making it easier to understand and troubleshoot complex systems.

Happy developers are productive developers, and Temporal’s approach fosters this by reducing the cognitive load and frustration associated with traditional workflow coding.

Why Durable Execution Matters#

Durable execution is more than a technical innovation; it’s a paradigm shift for building cloud-native systems. By decoupling business logic from state management and failure handling, Temporal empowers developers to build reliable, scalable systems with less effort. Whether you’re managing control planes, provisioning resources, orchestrating complex workflows, performing money transfers, training AI models, or processing social media posts, this approach delivers clear benefits.

At Temporal, we’ve seen firsthand how durable execution transforms the development process, enabling us to deliver a robust managed service that scales with our customers’ needs.

Ready to Transform Your Control Plane?#

Temporal isn’t just a tool for building cloud systems; it’s a better way to think about workflows and application architecture. If you’re building or planning a managed cloud service, consider how durable execution can simplify your journey and unlock new possibilities. For more insights into our approach, check out my full talk at QCon.


Ready to explore what Temporal Cloud can do? Now’s the perfect time — get $1,000 in Temporal Cloud credits and start building today!

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Temporal Cloud 云服务 架构设计 多租户 持久执行
相关文章