Temporal Cloud架构实践与经验

In today’s world of managed cloud services, delivering exceptional user experiences often requires rethinking traditional architecture and operational strategies. At Temporal, we faced this challenge head-on, navigating complex decisions about tenancy models, resource management, and durable execution to build a reliable, scalable cloud service. This post explores our approach and the lessons we learned while creating Temporal Cloud.

The Case for Managed Cloud Services#

Managed services have become the default for delivering hosted solutions to customers. Whether it’s a database, queueing system, or another server-side technology, hosting a service not only provides a better user experience but also opens doors for monetization, especially for open-source projects. The challenge is how to do it effectively while maintaining reliability and scalability.

One of the first decisions we made was about tenancy models. Should we pursue single-tenancy — provisioning dedicated clusters for each customer — or opt for multi-tenancy, which allows multiple customers to share the same resources? While single-tenancy offers simplicity and isolation, its inefficiencies quickly become apparent. Customers end up paying for unused capacity, and providers shoulder higher operational costs. Multi-tenancy, though harder to implement, emerged as the clear winner. It optimizes resource usage, allows customers to pay for actual usage, and creates shared headroom for handling traffic spikes.

Data Plane vs. Control Plane: Defining Responsibilities#

Architecting a managed service in terms of the data plane and control plane is an industry best practice that we followed, clearly defining and implementing their distinct roles within our cloud architecture.

Data Plane

Control Plane

Implementing the Data Plane: A Cell-Based Architecture#

For the data plane, we applied a cell-based architecture to achieve strong isolation and scalability. Each cell operates as a self-contained unit with its own AWS account, VPC, EKS cluster, and supporting infrastructure. While this approach is framed within the context of AWS, we have applied the same principles to Google Cloud Platform (GCP), leveraging its equivalent primitives to ensure consistency and reliability across cloud providers. This approach ensures that failures or updates in one cell do not impact others, reducing the risk of cascading outages.

Each cell in Temporal Cloud includes:

Compute Pods

Databases

Additional Components

Durable Execution: The Foundation of the Control Plane#

Building the control plane presented its own set of challenges, particularly around reliability and maintainability. Control plane tasks, such as provisioning namespaces or rolling out updates, involve complex long-running processes with many interdependent steps. Writing this logic as traditional, ad-hoc code often leads to brittle systems that are hard to debug and evolve.

This is where Temporal’s durable execution model shines. Designed based on experience with earlier systems like AWS Simple Workflow Service and Azure Durable Functions, Temporal’s approach separates business logic from state management and failure handling. Developers can write workflows as straightforward, happy-path code without worrying about retries, error handling, or state persistence. The system automatically manages these concerns, allowing workflows to seamlessly recover from failures.

Namespace Provisioning: A Real-World Example#

Consider the process of creating a new namespace in Temporal Cloud. When a user clicks “Create Namespace” on the web interface, the control plane orchestrates a series of tasks:

Selecting a suitable cell within the chosen region. Creating database records and roles. Generating and provisioning mTLS certificates. Configuring ingress routes and verifying connectivity. Each step involves external API calls, DNS propagation, and other potential points of failure.

Without durable execution, managing retries, backoffs, and state persistence would result in a tangle of brittle code. With Temporal, these tasks are encapsulated in workflows, which transparently handle retries and maintain state across failures. Developers can focus on the high-level logic, confident that the system will handle the edge cases.

Rolling Upgrades: Ensuring Safe Deployments#

Another common control plane scenario is rolling out updates to the Temporal Cloud fleet. Our deployment strategy involves organizing cells into deployment rings, progressing from pre-production environments to customer-facing cells with increasing priority of traffic.

The rollout process is carefully staged:

Ring 0

Ring 1

Higher Rings

Entity Workflows: A Powerful Pattern#

Temporal’s durable execution also enables powerful patterns like entity workflows. These are workflows tied to specific resources, such as cells or namespaces, providing a natural way to model state and operations. For example, each cell in Temporal Cloud has an entity workflow that manages its lifecycle, from provisioning to upgrades. This approach ensures consistency and simplifies concurrency control.

Developer Happiness and Productivity#

One of the biggest benefits of Temporal’s approach is the impact on developer experience. By eliminating the need to write boilerplate code for retries, backoffs, and state management, developers can focus on delivering business value. Temporal’s built-in tools for observing and debugging workflows further enhance productivity, making it easier to understand and troubleshoot complex systems.

Happy developers are productive developers, and Temporal’s approach fosters this by reducing the cognitive load and frustration associated with traditional workflow coding.

Why Durable Execution Matters#

Durable execution is more than a technical innovation; it’s a paradigm shift for building cloud-native systems. By decoupling business logic from state management and failure handling, Temporal empowers developers to build reliable, scalable systems with less effort. Whether you’re managing control planes, provisioning resources, orchestrating complex workflows, performing money transfers, training AI models, or processing social media posts, this approach delivers clear benefits.

At Temporal, we’ve seen firsthand how durable execution transforms the development process, enabling us to deliver a robust managed service that scales with our customers’ needs.

Ready to Transform Your Control Plane?#

Temporal isn’t just a tool for building cloud systems; it’s a better way to think about workflows and application architecture. If you’re building or planning a managed cloud service, consider how durable execution can simplify your journey and unlock new possibilities. For more insights into our approach, check out my full talk at QCon.

Ready to explore what Temporal Cloud can do? Now’s the perfect time — get $1,000 in Temporal Cloud credits and start building today!