Temporal Blog 09月30日 19:15
Temporal Worker 调优指南
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文深入探讨了Temporal Workers的配置与性能调优。文章首先介绍了Worker、任务队列、任务和命令等核心概念,并阐述了Worker如何通过拉取任务、执行代码和发送完成信号来推进工作流。接着,文章详细分析了何时需要增加Worker(如任务调度延迟升高、可用任务槽不足),以及何时可能拥有过多Worker(如轮询成功率下降、任务调度延迟过低但轮询超时)。最后,通过关键指标的解读,帮助用户找到Worker数量的平衡点,以优化系统性能并控制成本。

✨ **核心概念解析**:文章首先清晰定义了Temporal中的关键术语,包括Worker(执行工作流和活动的实体)、任务队列(逻辑上分离工作)、任务(工作流任务和活动任务)以及命令(Worker发送给Server的指令)。理解这些概念是深入调优的基础,为后续的性能分析奠定了坚实的基础。

📈 **何时需要更多Worker**:当观察到`workflow_task_schedule_to_start_latency`或`activity_task_schedule_to_start_latency`指标升高,表明任务在任务队列中积压,或者`worker_task_slots_available`指标低于零时,意味着Worker无法处理更多任务,此时应考虑增加Worker数量或调整并发配置,以提升处理能力。

📉 **何时Worker过多**:若出现轮询成功率(`poll_success`与`poll_timeouts`的比率)持续低于0.95(在某些高负载情况下可接受至0.8),且任务调度延迟极低但主机负载不高,这可能表明Worker数量过剩,导致Worker请求被Server拒绝或效率低下,此时应考虑减少Worker数量或轮询器数量。

📊 **关键性能指标洞察**:文章强调了几个关键指标,如`workflow_task_schedule_to_start_latency`(任务从入队到开始执行的时间)、`worker_task_slots_available`(Worker可用于执行任务的槽位)以及轮询成功率。通过监控这些指标,可以有效地诊断Worker性能瓶颈,无论是任务处理能力不足还是过度配置。

⚖️ **寻求最佳平衡**:最终目标是找到Worker数量的“甜点”,即在满足负载需求的同时,避免资源浪费和性能损耗。文章指出,过少或过多的Worker都会导致问题,并提供了识别和调整的策略,以确保Temporal系统的稳定、高效运行。

An Introduction to Worker Tuning#

When you were first learning Temporal, Workers probably seemed pretty simple:

// err handling omitted for brevityfunc main() {  c, err := client.NewClient(client.Options{})  defer c.Close()    w := worker.New(c, "task-queue", worker.Options{})    w.RegisterWorkflow(app.Workflow)  w.RegisterActivity(app.GreetingActivity)  err = w.Run(worker.InterruptCh())}

That is, register your Workflow and Activities, and then run until canceled. Is it really that simple? Regardless of what’s happening behind the scenes, you inevitably grow to a point where one Worker is not enough to handle the number of new Workflows.

Whether you’re self-hosting or using Temporal Cloud, you need a pool of Workers to actually run your Workflows and Activities. This guide discusses a few Worker deployment patterns and some of the configurations that have the biggest impact on Worker performance.

Note that, as in all production settings, performance tuning is a much more involved process than what can be covered in a single general-purpose article. This guide hopefully gives you a starting point for what to look for, but you should expect Worker tuning to be a nuanced and ongoing process.

Setup#

This guide assumes that you’re familiar with Temporal concepts and that you have written a proof of concept Workflow or have run one or more of the sample Workflows.

At the end of this guide, you will have the foundational knowledge necessary for deploying and managing your own Worker Pool.

This guide focuses on the Workers themselves and the strategies for performance tuning Worker-side code. For server-side tuning (in the case of a self-hosted deployment of the Temporal Cluster), see this article.

Background Terminology#

First, some concepts you should be familiar with before we dive deeper.

Workers

The term "Worker" is used to refer to the entity that actually runs your code and makes forward progress on your Workflows and Activities. A single process can have one or many Workers.

Most of your first Worker code is minimal: create a client, register Activities and a Workflow, and then run until shut down. But a lot happens behind the scenes, including maintaining long-poll connections to the server, caching and batch-sending Events and Commands, receiving new Tasks from the Server, emitting metrics, streaming Event Histories when necessary, and more.

Task Queues

Task Queues are the entities that Workers poll for more work, and it’s more than just a string you need to make sure is identical in multiple places. Task Queues serve as a logical separation of work, such as different types of Workflows or Activities with different functional requirements.

Tasks

Task Queues contain Tasks created by the Temporal Server. Tasks represent and contain the context needed for a Worker to make progress. There are two types of tasks: Workflow Tasks, which include the type of the Workflow and the Event history. And Activity Tasks, which specify the Activity to run.

Commands

Commands are sent from Workers to the Temporal Server to indicate what needs to happen next, such as starting a Timer or scheduling an Activity to run.

Events

Events are the records of your Workflow’s progress and include, among other things, results of completed Commands. Events, rather than Commands, are what you see in the Workflow Execution Event History. You may commonly reference Events to see the results of an Activity: when a Worker successfully completes an Activity, it sends the results to the Temporal Server, which adds an ActivityTaskCompleted Event with those results to the Workflow’s history.

The other common use for Events is in the Worker: included in every Workflow Task is the history of events known by the server to have already completed.

How Workers Work#

Workers are responsible for making progress on your Workflow and they do so by picking up Tasks from a Task Queue. These Tasks are either Workflow or Activity tasks, depending on the state of the Workflow Execution.

When Workers start, they open asynchronous long-polling connections to the server to get new tasks on the specified Task Queue. In the simplest form, when a Worker receives a Task, it does the following:

    Validate that the Task is not malformed (e.g., missing the Event History when it should be non-zero length) Check that this Worker can run the requested work (i.e., that the Workflow or Activity types are known and valid) If a Workflow Task, validate that the unseen portion of the Event History in the Task matches the known Workflow Definition (i.e., check for determinism)
      Then, run the Workflow code as far as possible (i.e., until all coroutines have yielded and are waiting on the Server)
    If an Activity Task, run the Activity with the given arguments Send “Task Completed” back to the Server and any results (e.g., an Activity’s return values)

At this point, the Temporal Server records the relevant updates to this Workflow instance’s Event History and generates new Tasks as necessary.

Aside: Why two different types of Tasks?#

If both Workflow and Activity Tasks follow the same pattern—check that the type is registered, then run the matching code—why are they different?

Recall why the concept of Activities exists in the first place: to ensure that unreliable work is isolated and doesn’t cause unnecessary failures of the overall system. For many categories of unreliability, such as API or network timeouts, simply retrying until successful is all that’s necessary to effectively remove that unreliability.

This conceptual separation between unreliable work and deterministic work necessitates the two different Task types. This allows the Activity Tasks to be substantially simpler than Workflow Tasks. It doesn’t matter whether it’s the first time this Activity has run, the Nth time, or happening after a long Timer. The Worker’s job is to run the Activity specified in the Task using the provided inputs and return the results.

A Workflow, meanwhile, must be deterministic. If it’s not, Temporal cannot guarantee that the Worker is running the same Workflow implementation that the Server thinks it should be. The Tasks that facilitate this, Workflow Tasks, must therefore include the information needed for the Worker to validate determinism. It can’t simply execute (or, like with Activities, re-execute) the next step in the Workflow. The Worker must check that how it thinks that step is reached is how the Server thinks it was reached. This is done by comparing the Event history given in the Workflow Task to what the Worker thinks the history would be if it re-ran the Workflow. We call this process “Replay.”

How many Workers do you need?#

When to know you need more#

As your application grows, it's inevitable that a single Worker won't be enough to complete all your Workflow or Activity Executions, as we’ll see below (in fact, best practice is to always have more than one). To keep up with the increased demand, you'll need to scale up your Workers. But as it always is with capacity planning, what’s the minimum number of Workers needed?

Reduced ability of Workers to start new Tasks#

Above: An example graph of the workflow_task_schedule_to_start_latency metric. At around timestamp 11:55 the number of incoming Workflows increased dramatically, causing a sizable (and growing) backlog in Workflow Tasks. At 12:05, the number of Workers was increased and the schedule_to_start latency started falling (this could also happen with an increase in the number of pollers).

Your Workflows only make progress when Tasks are dequeued by Workers and the associated work is run. The Schedule-To-Start metric represents how long Tasks are staying, unprocessed, in the Task Queues. If it’s increasing or high, and your hosts are at capacity, add Workers. If your Workflow Schedule-To-Start is increasing or high and your hosts still have capacity, increase the Workers’ max concurrent Workflow Task configuration.

Reduced ability of Workers to run Tasks#

Above: A contrived example where Activities have started coming in too fast for the number of Worker slots to handle. Configured at 5, but drops to 0 with heavy load. (Note: this graph and the previous occurred on different days and exemplify different scenarios. They are not occurring concurrently; they should be considered as entirely distinct from each other.)

Worker Task Slots represent how much capacity Workers have to actually do the work. When a Worker starts processing a Task, a Slot is filled. So when there are zero slots available, your Worker pool is unable to start new Tasks.

The worker_task_slots_available metric should always be >0. If it’s not, add more Workers or increase the max concurrent execution size, according to what’s reported in the metric’s worker_type label:

w := worker.New(client, "task_queue", worker.Options{      MaxConcurrentActivityExecutionSize:     1000, // default in the Go SDK is 1k     MaxConcurrentWorkflowTaskExecutionSize: 1000, // default in the Go SDK is 1k})

ℹ️ℹ️ℹ️

There’s a very remote coincidence that can occur where zero Task slots means you have a perfect balance between the ability for the Workers to complete work and the amount of incoming work. But in this coincidental balance, Tasks are still waiting in the Task Queue to be picked up by Workers. Even if that wait is a minuscule amount of time, no extra Task processing slots are available and so your application is not resilient to even the slightest increase in load.

ℹ️ℹ️ℹ️

This is not an exhaustive list of metrics or configuration values to use for determining when more Workers are needed, only the most common. For more, see the Worker Performance page in the Dev Guide and the full list of SDK and Server metrics.

When to know you have too many Workers#

Perhaps counterintuitively, it is possible to have too many workers. Beyond simply paying for too many compute resources, it could mean your Workers requests to the Server for new Tasks get denied.

There are three things you should look at for determining if you have too many Workers:

    Polling success rate

This rate, defined as

(poll_success + poll_success_sync) \ (poll_success + poll_success_sync + poll_timeouts)

represents how often Workers request a Task from the Server, but instead of getting either a Task or a “no Tasks in the queue” response, the request times out.

An example PromQL for this rate is as follows:

  (      sum by (namespace, taskqueue) (rate(poll_success[5m]))    +      sum by (namespace, taskqueue) (rate(poll_success_sync[5m]))  )/  (      sum by (namespace, taskqueue) (rate(poll_success[5m]))    +      sum by (namespace, taskqueue) (rate(poll_success_sync[5m]))    +      sum by (namespace, taskqueue) (rate(poll_timeouts[5m]))  )

In a well-functioning system, this rate should be at or very close to 1.0 and consistently greater than 0.95. But in reality, it's expected and normal for some percentage Worker pollers to timeout or, in a high load situation, for an acceptable timeout rate to be as low as 0.8.

    Schedule-To-Start latency

As mentioned earlier in this guide, the Workflow and Activity Schedule-To-Start latencies should be consistently very low as it represents Workers’ ability to keep up with the number of incoming Activity and Workflow executions. The exception to this is if you are purposefully rate limiting Activities at a Workfer or Task Queue granularity; in this situation, it's expected—and is not necessarily a sign to optimize your Workers—that Schedule-To-Start latencies will be high.

When low, you likely have plenty of Workers. However, when some of their requests for Tasks are timing out, then not only is a portion of your Worker pool doing nothing, it’s inefficiently doing nothing. (In many cases this is okay though. For example, over-provisioning to account for load spikes. As in all capacity provisioning, plan according to your specific load characteristics.)

    Host load

Many situations can lead to Task requests timing out, including an overloaded Worker host. And, having an overloaded Worker host is often a sign that you need more hosts.

But in the contrary case, where requests for Tasks are timing out while Worker host load is low (or has plenty of headroom), the Workers’ requests are likely being dropped. (These "Task requests" are the API calls the Workers make to the Temporal Server to get more work. Those specific calls are PollWorkflowTaskQueueRequest and PollActivityTaskQueueReqhttps://temporal.io//images.ctfassets.net/0uuz8ydxyd9p/5ykYdXuCrka28eaK9flW3y/26538af1f269d62e5140e8aa4a6de070/image1.pngaf1f269d62e5140e8aa4a6de070/image1.png" alt="Worker Host Load Graphs">

Above: A set of three graphs exemplifying a situation where there are too many Worker pollers. From left to right: (1) poll success rate is near 1.0 but starts to fall. (2) Workflow and Activity Schedule-To-Start latencies are, eventually, quite low (1-3ms). (3) CPU usage during the same time period is consistently much less than 20%. In this example, Workflow Schedule-To-Start latency started relatively high, and so the operator tried to add many more pollers to their Workers. This resulted in reduced Schedule-To-Start latency and no effect on host load, but also caused >80% of Worker polls to fail. They should reduce the number of pollers and/or Workers.

When all three of these factors are occurring, you likely have too many Workers. Consider reducing the number of pollers, the number of Workers, or the number of Worker hosts.

Summary: Identifying Poor Worker Performance#

Note that the metrics listed below are only some of the indicators for measuring Worker performance. Depending on your situation and environment(s), you may find other SDK metrics or Cluster metrics useful.

Metric(s) Metric Type What to look for Meaning
workflow_task_schedule_to_start_latency and activity_task_schedule_to_start_latency SDK High and/or increasing. This latency should be consistently near zero. Schedule-to-Start timeout is the time between when a Task is enqueued and when it is picked up by a Worker. This time being long means that your Workers can’t keep up — either increase the number of workers (if the host load is already high) or increase the number of pollers per worker.
worker_task_slots_available You should filter this metric on the worker_type tag (possible values: "ActivityWorker" or "WorkflowWorker") SDK Both metric types (WorkflowWorker and ActivityWorker) should be always >0 task_slots represents the number of executors (threads or otherwise) a Worker has available to actual run your code. In order to have room for incoming Workflows and Activities, this number should always be greater than 0.
poll_success vs poll_timeouts Server Underutilized Worker hosts, plus low Schedule-to-Start latencies, plus low (<90%) poll success rate. You may have too many workers. You might also start to see RPC call failures and rate limiting from the Temporal Server. Either way, consider reducing the number of Workers.

Conclusion#

This guide outlines one of the most fundamental pieces of Temporal: Workers. When you don’t have enough Workers, your Workflows can stall, slow down, or possibly never finish. When you have too many Workers, at best you’re paying too much for compute resources; at worst, your Workers are rate limited, denying them the ability to poll for new Tasks.

Either way, striking the balance of having the right amount of Workers can be tricky. Hopefully this guide provides you with pointers for how to find that balance in your application.

Title image courtesy of Pop & Zebra on Unsplash.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Temporal Worker Tuning Performance Optimization Workflow Execution Activity Execution Task Queues Metrics Scalability Go SDK
相关文章