Temporal Blog 09月30日 19:17
Temporal:重新理解“长期运行”任务
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

文章探讨了“长期运行”任务的真正含义,指出它并非仅限于耗时数小时的任务,而是涵盖了从几秒到数天甚至更长的各种场景。Box 使用 Temporal 来编排文件更新,大部分操作对用户而言是即时的,更看重其事务性和可靠性保障。而 Checkr 则利用 Temporal 协调需要数天的背景调查流程。Temporal 通过持久化事件历史和 ContinueAsNew API,能够支持无限期运行的工作流,并解决了分布式系统中的不确定性问题,从而分离了业务逻辑与系统本身的复杂性。

💡 "长期运行"任务的定义比传统认知更广泛,涵盖了从几秒到数小时甚至数天的工作场景。Box 将 Temporal 用于文件更新等看似即时的操作,但更看重其在微服务编排中的事务性和可靠性保障。这表明 Temporal 的适用范围远超传统的“长期运行”定义。

⏳ Temporal 能够处理无限期运行的工作流。通过持久化事件历史作为事实来源,并利用 ContinueAsNew API 可以在达到事件数量限制时,以快照形式保存状态并启动新的工作流执行。这使得可以为用户生命周期内的所有活动(如忠诚度奖励、订阅收费、提醒邮件等)建模。

⚖️ Temporal 的核心价值在于解决分布式系统中的“不确定性”。当跨越网络边界、需要水平扩展或协调迁移时,Temporal 能将这些机械性的不确定性与业务逻辑分离。它通过提供可靠的事件历史记录和强大的编排能力,帮助开发者应对分布式系统的复杂性,减少因不确定性导致的故障。

In most discussions of Temporal use cases, we go straight to talking about "long running" work. In the past, I typically associated that sort of work with "use if I need to do video processing". But I was wrong.

"Long running" usecases are both shorter and longer than you think.

Context#

If you surveyed most developers for what a "long running job" typically means to them, they would usually point to a range of intervals from 10 minutes to 6 hours, based on their background.

A common shorthand we've used in the past is "anything beyond request/response". You can use AWS Lambda's default timeout of 3 seconds as a boundary between "request/response" and "long running". But even though you can extend that timeout to 15 minutes, would you really want to?

Perhaps "long running" isn't really about some arbitrary cutoff in time.

"Long Running" can be really short#

Box uses Temporal for orchestrating file update operations. Although this can take hours for large transfer, the vast majority of these feel instantaneous to users. We ideally want one solution to scale from the smallest to largest usecases with no more visible latency than absolutely necessary. Box uses Temporal more for transactional and reliability guarantees around microservice orchestration, and the words "Long Running" were never even mentioned.

Workflow engines that rely on a centralized scheduler have a confusing floor to their latency and a poor fit for cases where you want the work to be done as soon as possible rather than once a day. It could be long running, but mostly it's not, and the programming model needs to scale with your users rather than swap out systems based on some arbitrary threshold.

Under the hood, Temporal accomplishes fast responses with long polling. Workers listen to Task Queues inside Temporal Server, which hold the HTTP connection open until the next Command is enqueued. While Temporal is not suited for truly realtime needs like gaming, this lets Temporal workers respond quickly to state changes and is much less resource intensive than using web sockets.

Note: this is also why we strongly recommend monitoring ScheduleToStart latency for production deployments, as a spike in this latency metric is your first indication of a spike in traffic or problem with your Workers or networking.

"Long Running" can be infinitely long#

Checkr uses Temporal for coordinating background checks. This is a multi-staged process with a huge range in processing times, ranging from pinging a database search API, to dispatching a court researcher to a court house, followed by analyzing each record and potentially escalating to manual QA. The whole process could take days.

The problem with hand-rolled orchestration systems built out of queues and databases is that the longer a job runs, the more moving parts, the more likely you will find which of the fallacies of distributed computing you have forgotten, whether it is failing to persist some state, instrumenting for observability and debuggability, or even rolling out changes with production infrastructure mid-flight.

Temporal solves this by persisting event histories as a source of truth, solving for both observability and reliability in one fell swoop. While there is a soft limit of 10,000 events in a single Workflow Execution, you can pick a natural cutoff point to snapshot state and start a new Workflow Execution with our ContinueAsNew API.

In practice, this means you can write infinitely long running Workflows. Since Temporal makes state easy to store and fully auditable, this has the potential to change your programming model completely, by allowing you to model every user as a long lived Entity in your system.

For example, you could use this for e-commerce: coordinating actions like loyalty rewards, subscription charges, and reminder emails over the entire history of their relationship with you.

It's really about Uncertainty#

The notion that "Temporal is for long running work" is more subtle than appears at first glance. If your microservices and third party APIs were always reliable and always responded quickly, you wouldn't need Temporal. If you didn't care about modeling and testing full journeys with code, from a multi-stage background check, to the entire history of a customer from signup to churn, then you could keep piecing them together on an ad-hoc basis.

But it's when you're unsure about parts of your distributed system that you need Temporal. When you cross network boundaries, when you need to scale horizontally, or when you need to coordinate migrations, you start needing a solution that separates these mechanical uncertainties from the business logic.

Most engineers have learned to deal with uncertainty piecemeal, painful incident after painful incident. We've assembled centuries of distributed systems experience to solve it for you.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Temporal 长期运行任务 工作流编排 分布式系统 可靠性 Temporal Long-Running Tasks Workflow Orchestration Distributed Systems Reliability
相关文章