少点错误 08月17日
Debugging for Mid Coders
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文作者分享了自己在调试代码过程中遇到的挑战,以及如何通过学习理性思维和掌握一系列实用技巧来提升调试效率。文章强调了耐心、系统性假设与证伪、扩展工作记忆的重要性。此外,还深入探讨了如何处理“下游症状”错误、识别抽象边界、运用二分查找法定位问题,以及检查数据而非仅仅代码。作者也坦诚了自己在处理不一致性bug、阅读文档以及理解整个技术栈方面仍存在的困惑,并期待更有效的隐性知识传达方式。

🎯 **耐心与理性思维是调试基石**:调试代码需要极大的耐心,并将其视为介于“玩具谜题”和“开放性现实问题”之间的挑战。作者强调,通过学习理性思维,如系统性地形成和检验假设,并扩展工作记忆(如做笔记、使用更大的显示器),可以显著提升调试能力,避免无效的“乱试一通”。

⚠️ **区分“红鲱鱼”与“下游症状”**:许多看似错误的错误信息(如“X不存在”)并非问题的根源,而是“下游症状”。作者建议,在遇到此类错误时,应优先寻找导致X未被创建或获取的地方,而非仅仅关注报错点本身,尤其当错误发生在依赖库中时更需注意。

🧱 **识别抽象边界与界定问题范围**:在代码调试过程中,理解代码的抽象边界至关重要。作者指出,当调试陷入僵局时,应判断是否已超出与当前问题直接相关的代码模块(如特定功能的代码),避免将精力浪费在不相关的底层基础设施上,从而更有效地缩小问题搜索范围。

🔍 **掌握二分查找法与版本回溯**:对于定位难以捉摸的bug,二分查找是一种有效的策略。无论是通过代码路径的向前或向后查找,还是利用Git的`bisect`命令在提交历史中查找引入问题的节点,都能大大缩短定位问题的范围。作者还提及了在撤销历史中进行二分查找的复杂性。

📊 **关注数据本身与检查假设**:调试不应只局限于代码层面,有时问题可能出在代码所处理的数据上。作者以自身经历为例,说明了检查输入数据(如损坏的Markdown文件)的重要性。当感到困惑时,应反思并列出所有可能相关的假设,检查是否存在未被充分理解或被忽略的因素。

Published on August 16, 2025 10:32 PM GMT

I struggled with learning to debug code for a long time. Exercises for learning debugging tended focus on small, toy examples that didn't grapple with the complexity of real codebases. I would read advice on the internet like:

I'd often be starting from a situation where the bug only happens sometimes and I have no idea when, the error log is about some unrelated library I was using and had nothing to do with what the bug would turn out to be. Getting to a point where it was even clear what exactly the symptoms were was a giant opaque question-mark for me.

I didn't improve much at debugging until I got generally serious about rationality training. Debugging is a nice in-between difficulty between "toy puzzle" and "solve a complex openended real world problem." (Code debugging is "real world problem-solving", but, it's a part of the world where you reliably know there's a solution, working in an environment you have near-complete control over and visibility into).

I attribute a lot of my improvement to fairly general problem-solving tools like:

I've written other essays on general rationality/problem-solving. But, here I wanted to write the essay I wish past me had gotten, about some tacit knowledge of debugging. (Partly because it seemed useful, and partly because I'm interested in the general art of "articulating tacit knowledge").

Note that I'm still a kinda mid-level programmer and senior people may have better advice than me, or think some of this is wrong. This is partly me just writing to help myself understand, and see how well I can currently explain things. 

Be willing to patiently walk up the stack trace (but, watch out for red-herrings and gotchas)

One core skill of debugging is the ability to patiently, thoroughly start from your first clue, and work your way through the codebase, developing a model of what exactly is happening. (Instead of reaching the edge of the current file and giving up)

Unfortunately, another core skill of debugging is knowing when you're about to follow the code into a pointless direction that won't really help you. 

Gotcha #1: "X does not exist", "can't read X", "X is undefined", etc, are often "downstream symptoms", rather than the bug itself.

If you get an error like "X doesn't exist", there's a codepath that expects X to exist, but it doesn't. Whatever code caused it to not-exist isn't particularly likely to be located anywhere near the code that's trying to read X.

So, unless you look at the part of the codebase flagging "can't find X" and you can clearly see that X was supposed to be created in the same file or have a good reason to think it was created nearby, probably what you should instead be looking for are places that are supposed to create or retrieve X. 

This goes double if the error is in some random library somewhere deep in your dependencies.

(I originally called this pattern a "red-herring." A senior-dev colleague told me they draw a sharp distinction between "red-herrings" and "downstream symptoms.")

Gotcha #2: Notice abstraction boundaries, and irrelevant abstractions.

The motivating incident that prompted this post was when I was pairing with a junior dev on debugging the JargonBot Beta Test. We had recently built an API endpoint[1] for retrieving the jargon definitions for a post. It had been working recently. Suddenly it was not working.

We started with where an error was getting thrown, worked our way backwards in the code path... and then followed the trail to another file... and then more backwards... 

...and then at some point both I and the junior dev said "it... probably isn't useful to look further backwards." (The junior dev said this hesitatingly, and I agreed)

We were right. How did we know that?

The answer (I think) is that if we stepped backwards any further, we were leaving the part of the codebase that was particularly about Jargon. Beyond that lay the underlying infrastructure for the LessWrong codebase, i.e. the code responsible for making API endpoints work at all. Last we checked, the underlying infrastructure of LessWrong worked fine, and we hadn't done anything to mess with that. 

So, even though we were still thoroughly confused about what the problem was, it seemed at least like it should be contained to within these few files.

There was a chunk of the codebase that you'd naturally describe as "about the JargonBot stuff." And there was a point you might naturally describe as it's edge, before passing into another part of the codebase.

Now, that was sort of playing on easy mode – we knew we had just started building JargonBot, it would be pretty crazy for the bug to not live somewhere in the files we had just written. But, a few weeks later we were debugging some other part of the codebase that other people had written, awhile ago. (I believe this was integrating the jargon into the post sidebar, where side-comments, side-notes, and reacts live). 

It turned out the sidebar was more complicated than I'd have guessed. The jargon wasn't rendering correctly. We had to pass backwards through a few different abstraction layers – first checking the new code where we were trying to fetch and render the jargon. Then back into the code for the sidebar itself. Then backwards into where the sidebar was integrated into the post page. And then I was confused about how the sidebar even was integrated into the post page, it wasn't nearly as straightforward as I thought.

Then I sort of roughly got how the sidebar/post interaction worked, but was still confused about my bug. I could recurse further up the code path...

...but it would be pretty unlikely for whatever was going wrong to be happening outside of the post page itself.

This all amounts to: 

Binary Search

There is some intricate art to looking at a confusing output, and generating hypotheses that might possibly make sense. A good developer has both some object-level knowledge about their codebase, or codebases in general, that can help them narrow in quickly on where the problem is located. 

I don't know how to articulate the details of that skill. BUT, the good new is there is a dumb, stupid algorithm you can follow to help you narrow it down even if you're tired, frustrated and braindead:

1: Follow the code backwards towards the earliest codepath that could possibly be relevant, and the earliest point where it is working as expected.

2. Find the spot about halfway between the earliest place where things work as expected, and the final place where you observe the bug. Log all relevant variables (or add a breakpoint and use a debugger).

3. If things look normal there, then find a new spot about halfway between that midpoint, and the final-spot-where-the-bug-was-observed. If it doesn't look normal, find a new spot about halfway between the earliest working spot, and the midpoint.

4. Repeat steps 2-3 until you've found the moment where things break.

Now you have a much smaller surface area you need to look at and understand. And instead of stepping through 100 lines of code, you had to find ~6 midpoints.

Binary searching through time spacetime

A variation on this is if you know the code used to work and now it doesn't. You can binary search through your history. If it worked a long time ago, on an older commit, you can binary search through your commit history. Start with the oldest commit where it worked, check a commit about halfway between that and the latest commit. Repeat. Git has a tool to help streamline process for this called git bisect

If it worked earlier today (and you didn't make any commits in the meanwhile), you may need to do something more like "binary search through your undo history." This is complicated by:

Sometimes this neatly divides into "binary search through undo-space" followed by "binary search within one history-state of the codebase". But sometimes you need to kind of maintain a multidimensional mental map that includes both changes-to-the-codebase and places-within-the-codebase, and figure out what it even means to binary search that in a somewhat ad-hoc way I'm not sure how to articulate.

Notice confusion, and check assumptions.

Also, look at your data, not just your code.

Just yesterday, I was trying build a text editor that took in markdown files, and rendered them as html that I could edit in a wysiwyg fashion. At some point, even though it was supposedly translating the markdown into html, it was still showing up with markdown formatting.

This happened after I an llm made some completely unrelated changes that really shouldn't have had anything to do with that.

I was pulling my hair out trying to figure out what was going wrong, stepping back and forth through undo history, looking at anything that changed.

Eventually, I stopped looking at the codepath, and looked at the file I was trying to load into the editor. 

At some point, I'd accidentally overwritten the file with corrupt markdown that was sort of... markdown nested inside html nested inside markdown. Or something.

Oh.

If I'd been a better rationalists, earlier in the hair-pulling-out-process, I could have noticed "this is confusing and doesn't make any goddamn sense." A rationalist should be more confused by fiction than reality. If I'm confused, one of my assumptions is fiction. And made a list of everything that could possibly be relevant, and see if there was anything I hadn't looked at yet, rather than looping back and forth in the undo history, vaguely flailing and hoping to notice something new.

But, it's kind of cognitively expensive to be a rationalist all the time. 

A simpler thing I could have done is be better debugger with some object-level debugging knowledge, and note that sometimes, it's not the code that's wrong, it's the data the code is trying to operate on that is wrong. "Check the data" probably should have been part of my initial pass at mapping out the problem.

(This ties back to Gotcha #1: if you're getting "X is not defined", where X was created awhile ago and not by the obvious nearby parts of the codebase, sometimes try looking up X in your database and see if there's anything weird about it)

Things I don't know

So, that was a bunch of stuff I've painstakingly figured out. Some of it I got pieces of by pairing with other more senior developers. The senior developers I've worked with often are thinking so quickly/intuitively it's fairly hard for them to slow down and explain what's going on. 

I'm hoping to live in a world where people around me get better at tacit knowledge explication. 

Here's some random stuff I still don't have a good handle on, that I'd like it if somebody explained:

How do you learn to replicate bugs, when they happen inconsistently in no discernable pattern? especially when the bug comes up, like, once every couple days or weeks, instead of once every 5 minutes.

How does one "read the docs?". Sometimes I ask how a senior dev figured something out, and they say "I read the documentation and it explained it." And I'm like "okay, duh. but... there's so much fucking documentation. I can't possibly be expected to read it all?"

Do people just read really fast? I think they have some heuristics for figuring out what parts to read and how to skim, which maybe involves something like binary search and tracking-abstraction-borders. But something about this still feels opaque to me.

Something something "how to build up a model of the entire stack." Sometimes, the source of a problem doesn't live in the codebase, it lives in the dev-ops of how the codebase is deployed. It could have to do with the database setup, or the deployment server, or our caching later. I recently got a little better at sniffing but this still feels like a muddy mire to me.

  1. ^

    This isn't exactly the right description but is accurate enough to be an example.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

代码调试 编程技巧 理性思维 软件开发 问题解决
相关文章