Temporal Blog 09月30日 19:16
生产环境中部署Temporal Worker的运维指南
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了在生产环境中部署Temporal Worker的关键运维要点,重点关注了四个核心领域:运维指标监控与告警、事件响应与工具、升级与版本管理,以及对故障路径的测试。文章强调了在分布式系统中,了解Worker的运行状况至关重要,并介绍了Temporal如何通过丰富的指标提供洞察。同时,它提供了识别和处理Workflow Task失败的策略,并深入讲解了如何利用`getVersion` API安全地迁移长期运行的Workflow,以及通过模拟Worker故障来测试和规划容量。最后,建议对超时和重试策略进行充分的测试,以应对不可避免的故障情况。

📊 **运维指标监控与告警**: 在生产环境中,了解Temporal Worker的运行状态至关重要。Temporal提供了丰富的客户端和服务器指标,包括计数器、延迟以及服务器和命名空间健康状况。通过Tally库,这些指标可以与m3、Prometheus、Statsd和Datadog等工具集成,并为Grafana提供仪表板。关键监控项包括Activity Worker和Workflow Worker的启动、完成、失败率以及Task延迟,这些指标是判断应用健康和设置告警的基础。

🚨 **事件响应与故障排查**: 当Workflow Task失败率飙升时,需要立即进入事件响应模式。排查的首要步骤是区分故障是源于应用程序还是Temporal Server本身。可以通过分析“service request”和“service error”指标来判断。若故障在应用程序端,日志是关键的调查工具,应确保日志包含工作流类型、活动类型、工作流ID和运行ID等结构化数据。此外,利用执行历史和Replay功能可以深入分析特定工作流的失败原因。

🔄 **升级与版本管理**: 迁移正在运行的长期Workflow是一个挑战。Temporal的`getVersion` API(或TypeScript中的Patching API)允许在保持Workflow确定性的前提下进行代码更改。该API通过内部跟踪Workflow启动时的代码版本,根据用户定义的逻辑分支来导航代码路径。在迁移前,建议对不同阶段的Workflow历史进行采样和回放测试。对于短期Workflow,也可考虑使用不同的Task Queue来隔离新旧代码的部署。

⚠️ **测试故障路径与容量规划**: 开发者常忽略非“happy path”的场景,如Backlogs(由于Worker或Server短暂离线导致多个Workflow Task同时解析)和超时。应通过模拟Worker停机来测试Backlogs,并进行容量规划,考虑系统在处理Backlog或高峰期时的资源消耗。此外,Temporal提供的多种Activity超时和重试策略需要通过单元测试和集成测试来验证,以确保其按预期工作。

Update: see also our new docs on SDK metrics and Temporal Worker Tuning for detailed instructions.

Whether you are self hosting Temporal Server or using Temporal Cloud, a key part of the operational model of Temporal is running your own Workers. We recently hosted a chat with Temporal CTO Samar Abbas on the 4 main areas to note when productionizing your Temporal Workers:

    Operational Metrics - Monitoring and Alerting Incident Response & Tooling Upgrading and Versioning Testing Failure Paths

Watch the full presentation here.

Operational Health Metrics - Monitoring and Alerting (02:30)#

The great thing about Temporal is that you can develop against it locally without constraints (and it is getting even faster with temporalite). But when it comes to running your Temporal application in production, you are now dealing with a distributed system.

In other words: How will you know if your Temporal application is doing its job or is even running as expected?

Temporal emits a lot of metrics, both client and server, which gives insights into how your application is performing: counters, latency, and metrics for server health and for each namespace. Temporal uses the Tally library from Uber to emit properly buffered metrics with nice reporting integration with m3, Prometheus, Statsd, and Datadog via custom reporters, and provides dashboards for Grafana users to display them. We also recently co-presented about Temporal and M3 with Chronosphere at KubeCon NA 2021.

    Activity Worker Health
      How many activities are getting started? How many activities are completing successfully? Which activities are failing? Set alerts for Activity failure rate
    Workflow Worker Health
      What rate are your Workflow Tasks happening at? Are your Workflow Tasks completing successfully? What is your Workflow Task latency? Set alerts for Workflow Task failure rate
    Scaling

Our full reference on SDK/Worker metrics is here.

When you have ingested all these metrics and have set up a dashboard to give you visibility, at least now you can start getting a n overall picture of what’s going on with the system.

When Workflow Task Failure rates spike, you are now in incident response mode. There can be many possible causes, so the first thing an operator needs to do is to figure out whether the failure is from your application or Temporal Server.

    Metrics. The first metric to look at is the “service request” and “service error” metrics emitted by the Server. If you are seeing workflow tasks or activities failing, it is a strong indication that you should focus on the application side rather than Temporal Server. Logs. The best tool for investigating application side failures is your logs - make sure your logs are stored and captured by your workers, and tagged with important structured data like workflow type, activity type, workflow ID, and run ID. Try to put as much information as tags to allow you to slice and dice logs on arbitrary criteria. Execution History. You should be able to tell whether your failures are coming from one particular workflow type, and to be able to isolate a specific workflow/run ID to investigate. You can then use Temporal Web or Temporal CLI (tctl) to look at the state and execution history of that workflow, which will reveal more datapoints on how to debug. In the history view, you should be able to see the entire call stack of the failure as well. Replay. You can also replay the Workflow to investigate it, by downloading the execution history via the Web or CLI tool, and stepping through the code using a debugger on your local machine.

Upgrading and Versioning (17:59)#

How do you migrate long running workflows that are still in flight? Most developers who hand roll their own job scheduling and orchestration frameworks don’t have anything better than “deploy and pray”, but Temporal offers a first class solution.

Temporal requires that your Workflows must be fully deterministic, and changing Workflow code means potentially breaking determinism. This is where the getVersion API is handy because it allows you to make changes in Workflow logic in a deterministic fashion.

Temporal offers a getVersion API in the Go, Java, PHP SDKs, and a Patching API in TypeScript, that essentially amounts to feature flagging by start time cohort:

v := workflow.GetVersion(ctx, "Step1", 1, 2)if v == 1 {        err = workflow.ExecuteActivity(ctx, ActivityC, data).Get(ctx, &result1)} else {        err = workflow.ExecuteActivity(ctx, ActivityD, data).Get(ctx, &result1)}

We have also published a 30 min tutorial on versioning to guide you through this API.

    How it works: Temporal internally tracks the version of Workflow code at the time that particular Workflow Execution was started, and uses that to navigate code paths based on these user defined branches. New Workflow Executions take the latest paths by default. Sample and Replay: While this is an elegant and powerful API (because it can be used and composed in arbitrary logic), it still needs to be handled carefully. This is why replaying workflow execution is so important. Before migration, take samples of Workflow histories at different stages, and use the replay API. Use Task Queues for Short-Lived Workflows: Think through your upgrade strategy - you may not even need versioning if your workflows are short-lived enough! If you just want to cut over to new code, you can run old code on the same Task Queue with a subset of Workers, and run new code on a different set of Workers with a new Task Queue! Worst Case: When in doubt - take a breath - because of Temporal’s event sourced nature, if a migration is screwed up, our worst case is that your Workflows don’t make progress - latency will spike, but data will not be lost - just make sure you are fully familiar with metrics and logs to investigate issues!

Testing Failure Paths (24:38)#

Most people (including us) primarily code for happy paths when thinking about business logic. However there are some unhappy paths that are persistently overlooked:

    Backlogs: A backlog happens when multiple Workflow tasks (mainly Timers) resolve together when some part of your system (say your Workers, or even the Server) is down. When Temporal recovers, it will still fire those timers, but it is not guaranteed to fire them in chronological order. However, happy-path code will typically assume that they do. Always test your system for backlogs.
      Specifically, when you write end-to-end tests for your production Temporal system, make sure to write some that shuts down your Workers for some time to create backlog of Workflow tasks, and then spin them up again to churn through the backlog. You can do it programmatically within an SDK Worker API or externally through your infra automation tool of choice.
    Capacity Planning: End to end tests that include Worker outages not only help find edge cases like backlogs. When the system has a backlog, the system’s resource consumption is very different than under normal circumstances. Always do capacity planning accounting for cases where the system is going through heavy load due to backlog or spikes. Timeouts: Temporal provides 4 kinds of Activity timeouts (we don’t recommend using Workflow timeouts) and a declarative Retry Policy which are very flexible, which is powerful but can be a source of mistakes. We recommend explicitly unit- and integration-testing for these timeouts - don’t assume that the code you have written will behave as you intended.

Did this discussion help with your production concerns? What other questions to you wish to ask? Email community@temporal.io to request the next topic!

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Temporal Worker Production Operations Monitoring Alerting Incident Response Versioning Testing Capacity Planning SDK Metrics Troubleshooting
相关文章