生产环境中部署Temporal Worker的运维指南

Update: see also our new docs on SDK metrics and Temporal Worker Tuning for detailed instructions.

Whether you are self hosting Temporal Server or using Temporal Cloud, a key part of the operational model of Temporal is running your own Workers. We recently hosted a chat with Temporal CTO Samar Abbas on the 4 main areas to note when productionizing your Temporal Workers:

Operational Metrics - Monitoring and Alerting Incident Response & Tooling Upgrading and Versioning Testing Failure Paths

Watch the full presentation here.

Operational Health Metrics - Monitoring and Alerting (02:30)#

The great thing about Temporal is that you can develop against it locally without constraints (and it is getting even faster with temporalite). But when it comes to running your Temporal application in production, you are now dealing with a distributed system.

In other words: How will you know if your Temporal application is doing its job or is even running as expected?

Temporal emits a lot of metrics, both client and server, which gives insights into how your application is performing: counters, latency, and metrics for server health and for each namespace. Temporal uses the Tally library from Uber to emit properly buffered metrics with nice reporting integration with m3, Prometheus, Statsd, and Datadog via custom reporters, and provides dashboards for Grafana users to display them. We also recently co-presented about Temporal and M3 with Chronosphere at KubeCon NA 2021.

Activity Worker Health

Set alerts for Activity failure rate

Workflow Worker Health

Set alerts for Workflow Task failure rate

Scaling

Our full reference on SDK/Worker metrics is here.

When you have ingested all these metrics and have set up a dashboard to give you visibility, at least now you can start getting a n overall picture of what’s going on with the system.

When Workflow Task Failure rates spike, you are now in incident response mode. There can be many possible causes, so the first thing an operator needs to do is to figure out whether the failure is from your application or Temporal Server.

Metrics

Logs

Execution History

Temporal Web or Temporal CLI (tctl)

Replay

replay the Workflow

Upgrading and Versioning (17:59)#

How do you migrate long running workflows that are still in flight? Most developers who hand roll their own job scheduling and orchestration frameworks don’t have anything better than “deploy and pray”, but Temporal offers a first class solution.

Temporal requires that your Workflows must be fully deterministic, and changing Workflow code means potentially breaking determinism. This is where the getVersion API is handy because it allows you to make changes in Workflow logic in a deterministic fashion.

Temporal offers a getVersion API in the Go, Java, PHP SDKs, and a Patching API in TypeScript, that essentially amounts to feature flagging by start time cohort:

v := workflow.GetVersion(ctx, "Step1", 1, 2)if v == 1 {        err = workflow.ExecuteActivity(ctx, ActivityC, data).Get(ctx, &result1)} else {        err = workflow.ExecuteActivity(ctx, ActivityD, data).Get(ctx, &result1)}

We have also published a 30 min tutorial on versioning to guide you through this API.

How it works

Sample and Replay:

Use Task Queues for Short-Lived Workflows

Worst Case

our worst case is that your Workflows don’t make progress

Testing Failure Paths (24:38)#

Most people (including us) primarily code for happy paths when thinking about business logic. However there are some unhappy paths that are persistently overlooked:

Backlogs

Always

shuts down your Workers

programmatically within an SDK Worker API

Capacity Planning

Timeouts

4 kinds of Activity timeouts

Retry Policy

Did this discussion help with your production concerns? What other questions to you wish to ask? Email community@temporal.io to request the next topic!

Operational Health Metrics - Monitoring and Alerting (02:30)#

Upgrading and Versioning (17:59)#

Testing Failure Paths (24:38)#

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签