Temporal Blog 09月30日
Temporal云降低应用延迟
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

开发者常担心Temporal会增加应用延迟。Temporal提供多种功能和工作流设计模式来满足大多数应用的需求,但最低延迟仍取决于工作器和Temporal服务之间的网络延迟以及Temporal服务的调优。Temporal云的端到端应用延迟显著低于自托管Temporal,因其拥有自定义持久化层等架构优势,有效降低了延迟。基准测试显示,Temporal云在五个关键指标上均优于自托管版本,完成工作流速度几乎是自托管实例的两倍。建议对延迟敏感、大规模或商业关键的应用考虑使用Temporal云,因其提供服务级别保证和支持,且每性能价格更低。

💡Temporal云通过自定义持久化层、高效分片、预写日志和分层存储等架构优势,显著降低应用延迟,其性能提升足以抵消与自托管版本的网络延迟增加。

📊基准测试显示,在五个关键指标(工作流端到端延迟、启动工作流延迟、发送工作流信号延迟、响应工作流任务完成延迟、响应活动任务完成延迟)上,Temporal云均优于自托管版本,其中p50和p90延迟均大幅降低。

⏱️Temporal云完成工作流的效率几乎是自托管实例的两倍,在p50和p90指标上分别快50.1%,有效解决了开发者对Temporal增加延迟的担忧。

💰Temporal云提供更低的价格每性能,避免了自托管版本中数据库和服务的扩展、资源调配及高可用部署的复杂成本,更适合大规模生产应用。

🔧建议所有生产工作负载,特别是延迟敏感、大规模或商业关键的应用,考虑使用Temporal云以获得服务级别保证、支持及更优的性能成本比。

Developers often worry Temporal will add latency to their applications. Temporal provides a variety of features and Workflow design patterns to help you meet the latency requirements of most apps, but ultimately, your minimum latency will depend on the network latency between your Workers and the Temporal Service, and how well the Temporal Service is tuned.

You might assume your application latency will be higher if you use Temporal Cloud rather than self-host Temporal. After all, with Temporal Cloud, the Temporal Service will no longer be located on the same infrastructure as your Workers, increasing network latency.

In reality, the opposite is true: end-to-end application latency is significantly lower when using Temporal Cloud compared to self-hosted Temporal. Temporal Cloud offers important architectural advancements to reduce latency, including a custom persistence layer. The latency improvements in Temporal Cloud’s architecture are so effective that they eclipse the cost of the higher network latency incurred when talking to Temporal Cloud.

We’ve frequently noticed latency improvements when customers migrate from self-hosted Temporal to Temporal Cloud. To quantify these observations, we benchmarked application-side metrics against a self-hosted Temporal Service and Temporal Cloud, with application Workers hosted in the same region.

The results demonstrate lower latency in Temporal Cloud compared to the self-hosted instance, supporting what we often tell customers: Temporal Cloud is the best choice for internet scale and low latency workloads.

Benchmark Overview and Setup#

We measured five application-side SDK metrics, which we chose because they contribute to application latency. Temporal’s SDKs emit these metrics by default.

The benchmarking infrastructure was established using a specialized latency benchmarking framework, which is accessible here. This framework constructs a Kubernetes cluster and deploys Omes, our designated tool for benchmarking and load testing. Omes incorporates Temporal Workers and a scenario runner to simulate an application environment.

To evaluate a wide range of SDK metrics we used the “throughput_stress” Omes scenario, which uses a comprehensive set of Temporal primitives. The activities used by the “throughput_stress” workflow are lightweight, with little CPU required. As this benchmark is focused on latency, rather than throughput, the scenario was configured to skip sleeps and to only run one workflow at a time. With this configuration, the workflow's end-to-end latency can serve as an effective metric for comparing Temporal Service instances.

During the benchmark, all pods, nodes, and databases were below 80% CPU utilization.

For testing of a self-hosted Temporal Service, the tool installs a Temporal Service backed by a MySQL database.

All of the measurements in this post were recorded in clusters running in the AWS us-west-2 region.

Metric 1: WorkflowEndtoEnd Latency#

What it measures: Workflow_EndtoEnd_Latency measures total execution time, from schedule to completion, for a single Workflow Execution.

Why we benchmarked it: This metric can be used to quickly compare the performance of instances running the same workload, as it shows the latency of a full Execution.

Results

p50 Latency p90 Latency
Self-Hosted Temporal 750 ms 950 ms
Temporal Cloud 376 ms 476 ms

Metric 2: StartWorkflowExecution Latency#

What it measures: the round-trip time for requests to start a Workflow. To start a Workflow your application contacts the Temporal Service. The Temporal Service durably persists records to represent the Workflow, and acknowledges the request with a response to your application. As the request has been durable persisted, the Temporal Service does not have to wait until the Workflow has started executing to respond.

Why we benchmarked it: This metric is important because applications tend to start Workflows frequently, often as a result of inline handling of a web request. Keeping this latency low is particularly important to avoid holding up web requests.

Results

p50 Latency p90 Latency
Self-Hosted Temporal 23.8 ms 42.6 ms
Temporal Cloud 17.7 ms 23.8 ms

Metric 3: SignalWorkflowExecution Latency

What it measures: the round-trip time for requests to signal a Workflow. As with Start workflow requests, these must be persisted by Temporal Service to ensure they will not be lost.

Why we benchmarked it: Applications use Signals to inform Workflows of external events. These Signals are often delivered as the result of some action in a UI or receiving an event on a message queue. Low latency here helps keep the application responsive and improves message queue efficiency.

Results

p50 Latency p90 Latency
Self-Hosted Temporal 17.5 ms 23.5 ms
Temporal Cloud 7.64 ms 9.76 ms

Metric 4: RespondWorkflowTaskCompleted Latency

What it measures: the response time from Workers to the Temporal Service when a Workflow Task is completed. As Workflows make progress, the Workers must communicate with the Temporal Service, detailing which actions must be taken next. This may be starting a new child workflow, scheduling an Activity, or simply just setting a Timer to wake the Workflow up again later.

Why we benchmarked it: Workflow throughput is impacted by how quickly Workers can communicate with the Temporal Service. If the Temporal Service responds more quickly, then Worker performance improves, and subsequently application performance.

Results

p50 Latency p90 Latency
Self-Hosted Temporal 23.9 ms 51.5 ms
Temporal Cloud 17.8 ms 24.7 ms

Metric 5: RespondActivityTaskCompleted Latency

What it measures: the response time from Workers to the Temporal Service when an Activity Task is completed.

Why we benchmarked it: Workflow throughput is impacted by how quickly Workers can communicate with the Temporal Service. Activities are used to perform a single, well-defined action such as calling another service or processing data. If the Temporal Service responds more quickly, then Worker performance improves, and subsequently application performance.

Results

p50 Latency p90 Latency
Self-Hosted Temporal 23.9 ms 61.7 ms
Temporal Cloud 17.3 ms 30.8 ms

Analysis of the latency differences#

This benchmark supports what we’ve observed anecdotally after customer migrations: Temporal Cloud provides lower application-side latency than self-hosted Temporal. The lower latencies on Temporal Cloud across the board resulted in a reduced end-to-end latency for our test workflow. Temporal Cloud was able to complete the workflow nearly twice as fast as the Self-Hosted Instance, completing the workflow in 50.1% of the time at both p50 and p90.

These latency improvements can be attributed to Temporal Cloud’s custom persistence layer, which includes more efficient sharding, a write-ahead log (WAL), and tiered data storage. We designed this architecture specifically for high throughput and large scale. As is apparent in this benchmark, the benefits of the custom persistence layer far outweigh any incurred network latency when using Temporal Cloud.

What this benchmark means for production applications#

Thousands of users currently run applications in production with both self-hosted Temporal and Temporal Cloud. Our recommendation is that you should consider Temporal Cloud for all production workloads due to the service level guarantees and support–especially for latency-sensitive, large-scale, or business-critical applications.

A final consideration is that Temporal Cloud provides a lower price per performance than self-hosted Temporal. The self-hosted Temporal Service in this benchmark was well-tuned and never overloaded (it was below 80% CPU at all times). In reality, this is not always the case. It can be labor-intensive to scale the Temporal Service for a high-throughput use case. You must scale your database (Postgres, MySQL, or Cassandra) and manage four additional independent services. The database and all services must be resourced properly: provisioned for peak load to avoid bottlenecks, and deployed in a highly available manner. These infrastructure and operational costs are steep compared to the consumption-based costs of Temporal Cloud.

Learn more and run the benchmark yourself#

We’ve provided details on how to reproduce this benchmark here. As with any benchmark, results may vary depending on your workload and scale.

Here are some other helpful resources to learn more:

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Temporal Cloud 低延迟 基准测试 工作流管理 云服务
相关文章