Blog about software - ordep.dev 10月02日 20:53
值班工程师:为何、如何以及期望
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了软件开发团队成员参与生产环境值班(on-call)的必要性、期望以及改进方法。作者认为,对于编写并部署生产代码并拥有付费客户的团队而言,值班是责任感和学习经验的重要来源。值班有助于深入理解系统,并通过处理实际事故来改进日志、追踪和告警。文章还强调了在值班前接受培训、观察资深工程师的重要性,以及在值班期间快速、准确沟通的必要性。为保持技能熟练,建议进行每周的模拟演练(fire-drills),以应对“操作性不足”(operational underload)的挑战,同时提升调试和写作能力。

🫵 责任与所有权:参与值班使开发者对他们部署到生产环境的代码产生强烈的责任感和归属感,遵循“你构建,你运行,你拥有”的原则,促使他们在编写代码时更加谨慎,以避免夜间被打扰。

💡 学习与成长:生产事故是宝贵的学习机会。通过排查问题,开发者能真正了解系统的运行机制,并在此过程中优化日志、追踪和告警,从而更深入地掌握系统特性,并在未来有效应对。

🤝 协作与沟通:在处理生产事件时,快速、准确且大量的沟通至关重要。作者鼓励团队成员在工作时间内共同协作解决问题,并确保所有相关人员都能及时了解进展,即使是在非工作时间发生的事件,也要保持透明的沟通。

🚀 持续改进与实践:为了应对“操作性不足”的挑战,建议每周进行模拟演练(fire-drills)。这不仅能帮助值班人员保持技能熟练,还能促进运行手册(runbooks)的更新,并提升在生产环境中调试复杂问题的能力,最终增强处理紧急情况的信心。

Being on-call, why, what and how.

Why should you be on-call?

I’m assuming you’re in a software development team and you write software that gets deployed into production. I’m alsoassuming you have paying customers, otherwise, your on-call program wouldn’t exist in the first place. If all this istrue, you should be on-call, at least for the stuff you write.

Responsibility

Being on-call gives a sense of responsibility and accountability for the code you deploy into production. You build it,you run it, you own it. You know that you need to be super careful with your code, otherwise, you’ll get paged at night.You don’t want it, I don’t want it as well.

Learning

Incidents are the best learning platform; you have to debug your system looking for clues. You get to truly know yoursystem while running it in production. Setting up meaningful logs and tracing provides you a way to track what yoursystem is doing. Setting up metrics and alerts provides you a way to track how your system is performing. If you have anincident in production, you will re-evaluate these things. You’ll come up with more fine-grain logs and metrics, thatwill tell you even more about your system.

Money

You shouldn’t be on-call just for money, but if you’re in a company that provides you a stipend for being on-call,it’s a win-win. At the end of the day, they pay you for writing great systems and for being responsible for them out ofworking hours.

What should you expect from being on-call?

I’ll tell you what your organization expects from you: “can you fix it?”.

You shouldn’t just land there

Even if you’re a seasoned member, you shouldn’t just land into an on-call rotation without experience it from others.Each system is unique and has its own traits. It takes time to pinpoint the possible root causes of an unknown system,so, at first, you should observe.

Someone has to coach you for one or more rotations. You should try to shadow on-call engineers. You’ll get familiar witha given system if you receive their own alerts, watch them looking for the root cause, and being the first to read thepost-mortem.

After you get a sense of how things behave, you should be ready to start your first on-call rotation.

I call this observe, collect, act later.

How to perform during an on-call rotation?

You need to act fast, be accurate, and communicate a lot. A lot of text, during an incident, is not a lot of text. Makeyourself a favor by communicating a lot during an incident, even if it’s a small thing.

You shouldn’t deal with incidents alone during working hours. Everyone should be able to contribute and you must keepthem posted.

You shouldn’t hide what you’re doing to solve a production incident. Everyone interested should be able to jump on acall with you. Having extra eyes looking into the problem will help you solve the issue. Additionally, you’ll get atleast a free pair of eyes to review the post-mortem. Everyone involved in the incident should contribute to it.

Make sure you behave the same way during out of working hours incidents. It may sound creepy writing for yourself at3AM. Once your team is up, they will thank you for keeping them posted.

How can you improve your performance during an on-call rotation?

Weekly drills

If you belong to an on-call rotation, you must be able to solve incidents. If you have few incidents per rotation, howdo you stay up-to-date with the system? Will you be able to solve a given incident two months from now withoutpracticing? Google calls this “Operational Underload”.

Being on-call for a quiet system is blissful, but what happens if the system is too quiet or when SREs are not on-calloften enough? An operational underload is undesirable for an SRE team. Being out of touch with production for longperiods can lead to confidence issues, both in terms of overconfidence and underconfidence, while knowledge gaps arediscovered only when an incident occurs.

You build confidence in your system and on your on-call rotation if you practice a lot beforehand. A great solution tobuild confidence in your rotation is to take part in fire-drills. It shouldn’t be difficult to set them in place everyweek to match each rotation. This way, everyone on-call can solve at least one incident per rotation.

Having weekly fire-drills will help you keep your runbooks updated. If your organization lack runbooks,fire-drills will get you started. Make sure you don’t skip the creation of meaningful logs and metrics after anincident or a fire drill. All this combined will boost your confidence to solve incidents after hours on your own.

Debugging skills

Sometimes, your metrics and logs won’t tell everything about your system during an incident. You’ll have to debug agiven service in production. Make sure you practice how to do it during the fire-drills, otherwise you’ll struggle abit. Know your stack, from top to bottom, and learn how to debug things in production, you won’t regret it.

Writing skills

Good communicators benefit the whole organization, during and after an incident. Make sure you communicate during theincident, frequent and clear. Use others to review your post-mortems. Learn from the best contributors you know. Copytheir writing style and keep improving. Everyone in the organization will thank you for being an excellent writer.

Keep practicing, keep writing 🖖

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

值班 On-call 软件开发 生产环境 责任 学习 沟通 模拟演练 调试 SRE
相关文章