Temporal Blog 09月30日 19:12
软件中的失败管理策略
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

在软件中,真正的系统不仅要编写能在“快乐路径”上工作的代码,还需要建立安全网来应对意外失败。现代软件开发涉及跨时空的系统连接、处理不眠数据以及大规模下的无故障性能。顶级开发者区别于普通开发者之处在于他们如何管理失败。构建弹性专注于在事情出错时确保可靠性,而不仅仅是维持正常运行时间。本文将探讨三种常见的软件失败处理方法:远程过程调用(RPC)、持久队列和工作流,并介绍Temporal的“工作流即代码”方法,帮助开发者在项目初期就构建可靠性。

💡 远程过程调用(RPC)模型通过直接的客户端-服务器连接简化实现,适用于简单同步任务,但在失败管理上存在局限性,需要客户端处理大量错误恢复和重试逻辑,随着系统扩展会变得复杂。

📈 持久队列通过解耦客户端和服务器提供灵活性,支持自动重试和异步处理,可平滑工作负载,但存在消息乱序、死信队列需要人工干预以及状态可见性差等问题,难以实现全面的失败管理。

🔄 工作流为复杂分布式系统提供健壮的编排方案,自动处理重试、状态和错误恢复,支持长流程并增强可见性,但需要专用基础设施支持且设置复杂,适用于需要高可靠性和透明度的复杂任务。

⚙️ Temporal平台的“工作流即代码”方法通过普通代码定义步骤,自动管理重试、状态和错误,无需额外定义文件,帮助开发者在项目初期就构建可靠性,已获ANZ等大型企业采用以提高金融流程的韧性。

🎯 有效的失败管理是战略而非挫折,从开发初期关注失败处理能区分优秀团队,将弹性构建到系统中为项目长期成功奠定基础。

There’s a saying: “Amateurs study tactics, while professionals study logistics.” In software, this translates to: “Amateurs focus on algorithms, while professionals focus on failures.”

At J on the Beach, I took time in my talk to expand on this saying and explain that real-world systems don’t just need code that works on the “happy path” — they need a safety net for when things go wrong.

Modern software development has layers of complexity. You’re not just writing code; you’re connecting systems across time and space, handling data that doesn’t sleep, and ensuring flawless performance at scale. What sets top developers apart is how they manage failures. Building resilience focuses on ensuring reliability when things inevitably go wrong, not just maintaining uptime.

In this post, we’ll walk through three common approaches to handling failures in software, each with its own strengths and weaknesses. Then we’ll introduce Temporal’s approach, workflow-as-code, which makes it easier to build reliability into your systems from day one.

Three Ways to Handle Failure in Your Software#

Failures are inevitable in your distributed systems. When a network link fails, a server times out, or a service crashes, systems need strategies to respond properly and ensure that your operations remain reliable.

Below, we’ll explore three common approaches to coordination between systems — Remote Procedure Calls (RPCs), persistent queues, and workflows — and their relationship to failure management.

1. Request-Response (RPC)#

The request-response, or RPC model, is a classic approach. A client makes a request, the server processes it, and sends back a response. In the best-case scenario — the “happy path” — everything works smoothly. Imagine a money transfer request: one service debits the sender while another credits the receiver. If all goes as planned, the transfer completes with no issues.

Pros of the RPC Model

    Simplicity: The direct client-server connection makes this model easy to implement for straightforward workflows. Efficiency on the “happy path”: When things go smoothly, RPC provides fast, efficient responses and low latency.

Cons of the RPC Model

    Limited resilience for partial failures: If the client’s request is successful, but a response isn’t received, or a step in the process fails, RPC often requires extensive error-handling code on the client side. Heavy client burden: Clients must handle errors, recovery, and retries, complicating systems as they scale.

The RPC model works well for simple, synchronous tasks. However, for resilience, it falls short by placing the onus on developers of the RPCs and those consuming them to manage every failure scenario — and this is no trivial matter.

2. Persistent Queues#

Persistent queues add a degree of flexibility by decoupling the client from the server. Messages are placed in a queue, and the system processes them asynchronously. Queues help distribute workloads: they support automatic retries and asynchronous processing, which can smooth out demand spikes.

Pros of Persistent Queues

    Automatic retries: Persistent queues often support automatic retries, attempting tasks multiple times if they initially fail. Load distribution: Queues smooth processing under heavy loads, distributing requests over time, to improve system reliability. Producer-consumer separation: Decoupling producers and consumers allow the queue to function independently, improving fault tolerance.

Cons of Persistent Queues

    Loss of ordering: Since queues process messages independently, tasks may execute out of order, causing unexpected issues for dependent operations. Dead-letter queues: Tasks that continuously fail may require a separate “dead-letter” queue, adding complexity and, typically, manual intervention. Limited visibility into status: Visibility becomes even more challenging when you have systems that use multiple queues, requiring additional tooling and infrastructure.

Queues work well when you need flexibility and decoupling, but they lack the control and visibility needed for comprehensive failure management.

3. Workflows#

Workflows provide a robust solution for orchestrating complex processes across distributed systems. Unlike RPC or queue-based models, workflows manage retries, state, and error handling automatically, making them ideal for long-running or multi-step processes.

Pros of Workflows

    Built-in resilience: Workflows handle retries, recovery, and compensation steps automatically, reducing the need for custom error-handling code. Support for long-running processes: Workflows accommodate processes that span minutes, hours, or even days, making them well-suited for complex tasks. Enhanced visibility: Workflow systems enable real-time tracking and querying, so both clients and developers can see exactly where each process stands.

Cons of Workflows

    Infrastructure requirements: Workflows require a solid infrastructure to manage states, retries, and tracking, which some teams may lack. Setup complexity: Workflow systems can be complex to set up, especially when building custom solutions to manage workflows.

For complex processes that demand reliability and transparency, workflows provide the most comprehensive solution, though they require dedicated infrastructure to deploy effectively.

At Temporal, we addressed these challenges by designing a platform that handles resilience, error handling, and state management so you don’t have to.

With Temporal, you write workflows as code - no extra XML, JSON, or YAML definition of workflow logic that is difficult to understand and debug down the line. Define your steps in regular code, and Temporal does the rest, managing retries, maintaining state, and ensuring that your workflows are reliable and simple to create.

Companies like ANZ Bank, one of the largest banks in the Asia-Pacific region, rely on Temporal to strengthen the resilience and reliability of critical financial processes. With Temporal, ANZ orchestrates and manages complex operations across distributed systems, ensuring tasks are retried automatically, failures are handled, and long-running processes are tracked seamlessly. This has enabled ANZ to boost system reliability, reduce operational complexity, and uphold strict compliance standards in their high-stakes FinServ environment.

Failure Management Is a Strategy, Not a Setback#

Any complex system will encounter failures. But how you handle those failures makes all the difference. For developers, focusing on failure management from the start distinguished exceptional teams from the average. Building resilience into your system sets your project up for long-term success.

Discover how to make your app resilient with examples from leading companies like Snap and Coinbase, or start a free trial of Temporal Cloud today.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

软件可靠性 失败管理 工作流即代码 Temporal 分布式系统 RPC 持久队列 弹性架构
相关文章