Temporal Blog 09月30日 19:15
时间旅行调试与持久执行
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

时间旅行调试是一种强大的调试技术,允许开发人员回溯程序执行历史,查看任意时间点的程序状态。本文介绍了时间旅行调试的原理、历史、实现方法,以及如何应用于生产环境中的调试。传统调试器只能向前执行,而时间旅行调试器可以向后回溯,帮助开发人员更好地理解程序执行过程。历史上看,Smalltalk-76、MIT的DDT和ZStep 95等早期工具奠定了基础,而现代工具如Replay、WinDbg、rr和Undo则提供了更完善的实现。实现方法包括记录与回放、快照和代码插桩。在生产环境中,时间旅行调试可以记录进程执行历史并在开发机上回放,但需注意开销问题。持久执行是一种记录每一步的编程模式,确保失败时能从同一步骤重新开始,常用于微服务架构。结合持久执行,开发人员可以深入分析生产环境中的代码执行,提高调试效率。

💡时间旅行调试允许开发人员回溯程序执行历史,查看任意时间点的程序状态,比传统调试器更灵活,能更好地理解程序执行过程。

🕰️时间旅行调试的历史可追溯至Smalltalk-76(1976)、MIT的DDT(1980)和ZStep 95(1995),现代工具如Replay、WinDbg、rr和Undo则提供了更完善的实现,支持多种编程语言和平台。

📊实现方法包括记录与回放(如rr和Replay)、快照(如WinDbg)和代码插桩(如Undo),各有优缺点,需根据场景选择合适的工具和技术。

🔧在生产环境中,时间旅行调试可以记录进程执行历史并在开发机上回放,但需注意高开销问题,适用于特定场景如临时启用Undo或已有记录的场景。

🔄持久执行是一种记录每一步的编程模式,确保失败时能从同一步骤重新开始,常用于微服务架构,结合时间旅行调试可深入分析生产环境中的代码执行,提高调试效率。

🌐持久执行不依赖特殊硬件或高开销技术,只需确保代码确定性(相同输入产生相同结果),可通过数据库记录交互步骤和结果,实现高吞吐量的函数恢复和自动重试。

🛠️Temporal等现代持久执行框架支持多种语言(Go、Java、JavaScript、Python、.NET、PHP),提供强大的可视化调试能力,允许开发人员下载执行历史并在本地调试,极大提升生产环境问题排查效率。

In this post, I’ll give an overview of time-travel debugging (what it is, its history, how it’s implemented) and show how it relates to debugging your production code.

Normally, when we use debuggers, we set a breakpoint on a line of code, we run our code, execution pauses on our breakpoint, we look at values of variables and maybe the call stack, and then we manually step forward through our code's execution. In time-travel debugging, also known as reverse debugging, we can step backward as well as forward. This is powerful because debugging is an exercise in figuring out what happened: traditional debuggers are good at telling you what your program is doing right now, whereas time-travel debuggers let you see what happened. You can wind back to any line of code that executed and see the full program state at any point in your program’s history.

History and current state#

It all started with Smalltalk-76, developed in 1976 at Xerox PARC. (Everything started at PARC 😄.) It had the ability to retrospectively inspect checkpointed places in execution. Around 1980, MIT added a "retrograde motion" command to its DDT debugger, which gave a limited ability to move backward through execution. In a 1995 paper, MIT researchers released ZStep 95, the first true reverse debugger, which recorded all operations as they were performed and supported stepping backward, reverting the system to the previous state. However, it was a research tool and not widely adopted outside academia.

ODB, the Omniscient Debugger, was a Java reverse debugger that was introduced in 2003, marking the first instance of time-travel debugging in a widely used programming language. GDB (perhaps the most well-known command-line debugger, used mostly with C/C++) added it in 2009.

Now, time-travel debugging is available for many languages, platforms, and IDEs, including:

    Replay for JavaScript in Chrome, Firefox, and Node, and Wallaby for tests in Node WinDbg for Windows applications rr for C, C++, Rust, Go, and others on Linux Undo for C, C++, Java, Kotlin, Rust, and Go on Linux Various extensions (often rr- or Undo-based) for Visual Studio, VS Code, JetBrains IDEs, Emacs, etc.

Implementation techniques#

There are three main approaches to implementing time-travel debugging:

    Record & Replay: Record all non-deterministic inputs to a program during its execution. Then, during the debug phase, the program can be deterministically replayed using the recorded inputs in order to reconstruct any prior state. Snapshotting: Periodically take snapshots of a program's entire state. During debugging, the program can be rolled back to these saved states. This method can be memory-intensive because it involves storing the entire state of the program at multiple points in time. Instrumentation: Add extra code to the program that logs changes in its state. This extra code allows the debugger to step the program backwards by reverting changes. However, this approach can significantly slow down the program's execution.

rr uses the first (the rr name stands for Record and Replay), as does Replay. WinDbg uses the first two, and Undo uses all three (see how it differs from rr).

Time-traveling in production#

Traditionally, running a debugger in prod doesn't make much sense. Sure, we could SSH into a prod machine and start the process handling requests with a debugger and a breakpoint, but once we hit the breakpoint, we're delaying responses to all current requests and unable to respond to new requests. Also, debugging non-trivial issues is an iterative process: we get a clue, we keep looking and find more clues; discovery of each clue is typically rerunning the program and reproducing the failure. So, instead of debugging in production, what we do is replicate on our dev machine whatever issue we're investigating and use a debugger locally (or, more often, add log statements 😄), and re-run as many times as required to figure it out. Replicating takes time (and in some cases a lot of time, and in some cases infinite time), so it would be really useful if we didn't have to.

While running traditional debuggers doesn't make sense, time-travel debuggers can record a process execution on one machine and replay it on another machine. So we can record (or snapshot or instrument) production and replay it on our dev machine for debugging (depending on the tool, our machine may need to have the same CPU instruction set as prod). However, the recording step generally doesn't make sense to use in prod given the high amount of overhead—if we set up recording and then have to use ten times as many servers to handle the same load, whoever pays our AWS bill will not be happy 😁.

But there are a couple scenarios in which it does make sense:

    Undo only slows down execution 2–5x, so while we don't want to leave it on just in case, we can turn it on temporarily on a subset of prod processes for hard-to-repro bugs until we have captured the bug happening, and then we turn it off. When we're already recording the execution of a program in the normal course of operation.

The rest of this post is about #2, which is a way of running programs called durable execution.

Durable execution#

What's that?#

First, a brief backstory. After Amazon (one of the first large adopters of microservices) decided that using message queues to communicate between services was not the way to go (hear the story first-hand here), they started using orchestration. And once they realized defining orchestration logic in YAML/JSON wasn't a good developer experience, they created AWS Simple Workfow Service to define logic in code. This technique of backing code by an orchestration engine is called durable execution, and it spread to Azure Durable Functions, Cadence (used at Uber for > 1,000 services), and Temporal (used by Stripe, Netflix, Datadog, Snap, Coinbase, and many more).

Durable execution runs code durably—recording each step in a database, so that when anything fails, it can be retried from the same step. The machine running the function can even lose power before it gets to line 10, and another process is guaranteed to pick up executing at line 10, with all variables and threads intact.1 It does this with a form of record & replay: all input from the outside is recorded, so when the second process picks up the partially-executed function, it can replay the code (in a side-effect–free manner) with the recorded input in order to get the code into the right state by line 10.

Durable execution's flavor of record & replay doesn't use high-overhead methods like software JIT binary translation, snapshotting, or instrumentation. It also doesn't require special hardware. It does require one constraint: durable code must be deterministic (i.e., given the same input, it must take the same code path). So it can't do things that might have different results at different times, like use the network or disk. However, it can call other functions that are run normally ("volatile functions", as we like to call them 😄), and while each step of those functions isn't persisted, the functions are automatically retried on transient failures (like a service being down).

Only the steps that require interacting with the outside world (like calling a volatile function, or calling sleep('30 days'), which stores a timer in the database) are persisted. Their results are also persisted, so that when you replay the durable function that died on line 10, if it previously called the volatile function on line 5 that returned "foo", during replay, "foo" will immediately be returned (instead of the volatile function getting called again). While yes, it adds latency to be saving things to the database, Temporal supports extremely high throughput (tested up to a million recorded steps per second). And in addition to function recoverability and automatic retries, it comes with many more benefits, including extraordinary visibility into and debuggability of production.

Debugging prod#

With durable execution, we can read through the steps that every single durable function took in production. We can also download the execution’s history, checkout the version of the code that's running in prod, and pass the file to a replayer (Temporal has runtimes for Go, Java, JavaScript, Python, .NET, and PHP) so we can see in a debugger exactly what the code did during that production function execution. Read this post or watch this video to see an example in VS Code.2

Being able to debug any past production code is a huge step up from the other option (finding a bug, trying to repro locally, failing, turning on Undo recording in prod until it happens again, turning it off, then debugging locally). It's also a (sometimes necessary) step up from distributed tracing.


💬 Discuss on Hacker News, Reddit, Twitter, or LinkedIn.

I hope you found this post interesting! If you'd like to learn more about durable execution, I recommend reading:

and watching:

Thanks to Greg Law, Jason Laster, Chad Retz, and Fitz for reviewing drafts of this post.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

时间旅行调试 持久执行 生产环境调试 微服务 Temporal Replay WinDbg rr Undo
相关文章