时间旅行调试与持久执行

In this post, I’ll give an overview of time-travel debugging (what it is, its history, how it’s implemented) and show how it relates to debugging your production code.

Normally, when we use debuggers, we set a breakpoint on a line of code, we run our code, execution pauses on our breakpoint, we look at values of variables and maybe the call stack, and then we manually step forward through our code's execution. In time-travel debugging, also known as reverse debugging, we can step backward as well as forward. This is powerful because debugging is an exercise in figuring out what happened: traditional debuggers are good at telling you what your program is doing right now, whereas time-travel debuggers let you see what happened. You can wind back to any line of code that executed and see the full program state at any point in your program’s history.

History and current state#

It all started with Smalltalk-76, developed in 1976 at Xerox PARC. (Everything started at PARC 😄.) It had the ability to retrospectively inspect checkpointed places in execution. Around 1980, MIT added a "retrograde motion" command to its DDT debugger, which gave a limited ability to move backward through execution. In a 1995 paper, MIT researchers released ZStep 95, the first true reverse debugger, which recorded all operations as they were performed and supported stepping backward, reverting the system to the previous state. However, it was a research tool and not widely adopted outside academia.

ODB, the Omniscient Debugger, was a Java reverse debugger that was introduced in 2003, marking the first instance of time-travel debugging in a widely used programming language. GDB (perhaps the most well-known command-line debugger, used mostly with C/C++) added it in 2009.

Now, time-travel debugging is available for many languages, platforms, and IDEs, including:

Implementation techniques#

There are three main approaches to implementing time-travel debugging:

Record & Replay

Snapshotting

Instrumentation

rr uses the first (the rr name stands for Record and Replay), as does Replay. WinDbg uses the first two, and Undo uses all three (see how it differs from rr).

Time-traveling in production#

Traditionally, running a debugger in prod doesn't make much sense. Sure, we could SSH into a prod machine and start the process handling requests with a debugger and a breakpoint, but once we hit the breakpoint, we're delaying responses to all current requests and unable to respond to new requests. Also, debugging non-trivial issues is an iterative process: we get a clue, we keep looking and find more clues; discovery of each clue is typically rerunning the program and reproducing the failure. So, instead of debugging in production, what we do is replicate on our dev machine whatever issue we're investigating and use a debugger locally (or, more often, add log statements 😄), and re-run as many times as required to figure it out. Replicating takes time (and in some cases a lot of time, and in some cases infinite time), so it would be really useful if we didn't have to.

While running traditional debuggers doesn't make sense, time-travel debuggers can record a process execution on one machine and replay it on another machine. So we can record (or snapshot or instrument) production and replay it on our dev machine for debugging (depending on the tool, our machine may need to have the same CPU instruction set as prod). However, the recording step generally doesn't make sense to use in prod given the high amount of overhead—if we set up recording and then have to use ten times as many servers to handle the same load, whoever pays our AWS bill will not be happy 😁.

But there are a couple scenarios in which it does make sense:

2–5x

turn it on temporarily

The rest of this post is about #2, which is a way of running programs called durable execution.

Durable execution#

What's that?#

First, a brief backstory. After Amazon (one of the first large adopters of microservices) decided that using message queues to communicate between services was not the way to go (hear the story first-hand here), they started using orchestration. And once they realized defining orchestration logic in YAML/JSON wasn't a good developer experience, they created AWS Simple Workfow Service to define logic in code. This technique of backing code by an orchestration engine is called durable execution, and it spread to Azure Durable Functions, Cadence (used at Uber for > 1,000 services), and Temporal (used by Stripe, Netflix, Datadog, Snap, Coinbase, and many more).

Durable execution runs code durably—recording each step in a database, so that when anything fails, it can be retried from the same step. The machine running the function can even lose power before it gets to line 10, and another process is guaranteed to pick up executing at line 10, with all variables and threads intact.¹ It does this with a form of record & replay: all input from the outside is recorded, so when the second process picks up the partially-executed function, it can replay the code (in a side-effect–free manner) with the recorded input in order to get the code into the right state by line 10.

Durable execution's flavor of record & replay doesn't use high-overhead methods like software JIT binary translation, snapshotting, or instrumentation. It also doesn't require special hardware. It does require one constraint: durable code must be deterministic (i.e., given the same input, it must take the same code path). So it can't do things that might have different results at different times, like use the network or disk. However, it can call other functions that are run normally ("volatile functions", as we like to call them 😄), and while each step of those functions isn't persisted, the functions are automatically retried on transient failures (like a service being down).

Only the steps that require interacting with the outside world (like calling a volatile function, or calling sleep('30 days'), which stores a timer in the database) are persisted. Their results are also persisted, so that when you replay the durable function that died on line 10, if it previously called the volatile function on line 5 that returned "foo", during replay, "foo" will immediately be returned (instead of the volatile function getting called again). While yes, it adds latency to be saving things to the database, Temporal supports extremely high throughput (tested up to a million recorded steps per second). And in addition to function recoverability and automatic retries, it comes with many more benefits, including extraordinary visibility into and debuggability of production.

Debugging prod#

With durable execution, we can read through the steps that every single durable function took in production. We can also download the execution’s history, checkout the version of the code that's running in prod, and pass the file to a replayer (Temporal has runtimes for Go, Java, JavaScript, Python, .NET, and PHP) so we can see in a debugger exactly what the code did during that production function execution. Read this post or watch this video to see an example in VS Code.²

Being able to debug any past production code is a huge step up from the other option (finding a bug, trying to repro locally, failing, turning on Undo recording in prod until it happens again, turning it off, then debugging locally). It's also a (sometimes necessary) step up from distributed tracing.

💬 Discuss on Hacker News, Reddit, Twitter, or LinkedIn.

I hope you found this post interesting! If you'd like to learn more about durable execution, I recommend reading:

and watching:

Thanks to Greg Law, Jason Laster, Chad Retz, and Fitz for reviewing drafts of this post.

History and current state#

Implementation techniques#

Time-traveling in production#

Durable execution#

What's that?#

Debugging prod#

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签