软件中的失败管理策略

There’s a saying: “Amateurs study tactics, while professionals study logistics.” In software, this translates to: “Amateurs focus on algorithms, while professionals focus on failures.”

At J on the Beach, I took time in my talk to expand on this saying and explain that real-world systems don’t just need code that works on the “happy path” — they need a safety net for when things go wrong.

Modern software development has layers of complexity. You’re not just writing code; you’re connecting systems across time and space, handling data that doesn’t sleep, and ensuring flawless performance at scale. What sets top developers apart is how they manage failures. Building resilience focuses on ensuring reliability when things inevitably go wrong, not just maintaining uptime.

In this post, we’ll walk through three common approaches to handling failures in software, each with its own strengths and weaknesses. Then we’ll introduce Temporal’s approach, workflow-as-code, which makes it easier to build reliability into your systems from day one.

Three Ways to Handle Failure in Your Software#

Failures are inevitable in your distributed systems. When a network link fails, a server times out, or a service crashes, systems need strategies to respond properly and ensure that your operations remain reliable.

Below, we’ll explore three common approaches to coordination between systems — Remote Procedure Calls (RPCs), persistent queues, and workflows — and their relationship to failure management.

1. Request-Response (RPC)#

The request-response, or RPC model, is a classic approach. A client makes a request, the server processes it, and sends back a response. In the best-case scenario — the “happy path” — everything works smoothly. Imagine a money transfer request: one service debits the sender while another credits the receiver. If all goes as planned, the transfer completes with no issues.

Pros of the RPC Model

Simplicity

Efficiency on the “happy path

Cons of the RPC Model

Limited resilience for partial failures

Heavy client burden

The RPC model works well for simple, synchronous tasks. However, for resilience, it falls short by placing the onus on developers of the RPCs and those consuming them to manage every failure scenario — and this is no trivial matter.

2. Persistent Queues#

Persistent queues add a degree of flexibility by decoupling the client from the server. Messages are placed in a queue, and the system processes them asynchronously. Queues help distribute workloads: they support automatic retries and asynchronous processing, which can smooth out demand spikes.

Pros of Persistent Queues

Automatic retries

Load distribution

Producer-consumer separation

Cons of Persistent Queues

Loss of ordering

Dead-letter queues

Limited visibility into status

Queues work well when you need flexibility and decoupling, but they lack the control and visibility needed for comprehensive failure management.

3. Workflows#

Workflows provide a robust solution for orchestrating complex processes across distributed systems. Unlike RPC or queue-based models, workflows manage retries, state, and error handling automatically, making them ideal for long-running or multi-step processes.

Pros of Workflows

Built-in resilience

Support for long-running processes

Enhanced visibility

Cons of Workflows

Infrastructure requirements

Setup complexity

For complex processes that demand reliability and transparency, workflows provide the most comprehensive solution, though they require dedicated infrastructure to deploy effectively.

At Temporal, we addressed these challenges by designing a platform that handles resilience, error handling, and state management so you don’t have to.

With Temporal, you write workflows as code - no extra XML, JSON, or YAML definition of workflow logic that is difficult to understand and debug down the line. Define your steps in regular code, and Temporal does the rest, managing retries, maintaining state, and ensuring that your workflows are reliable and simple to create.

Companies like ANZ Bank, one of the largest banks in the Asia-Pacific region, rely on Temporal to strengthen the resilience and reliability of critical financial processes. With Temporal, ANZ orchestrates and manages complex operations across distributed systems, ensuring tasks are retried automatically, failures are handled, and long-running processes are tracked seamlessly. This has enabled ANZ to boost system reliability, reduce operational complexity, and uphold strict compliance standards in their high-stakes FinServ environment.

Failure Management Is a Strategy, Not a Setback#

Any complex system will encounter failures. But how you handle those failures makes all the difference. For developers, focusing on failure management from the start distinguished exceptional teams from the average. Building resilience into your system sets your project up for long-term success.

Discover how to make your app resilient with examples from leading companies like Snap and Coinbase, or start a free trial of Temporal Cloud today.

Three Ways to Handle Failure in Your Software#

1. Request-Response (RPC)#

2. Persistent Queues#

3. Workflows#

Failure Management Is a Strategy, Not a Setback#

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签