https://eugeneyan.com/rss 09月30日
如何撰写数据科学/机器学习项目的设计文档
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

设计文档的目的是帮助作者深入思考问题和解决方案,并获取反馈。撰写设计文档的过程迫使作者研究和理解问题领域,并通过文档清晰地组织思路。设计文档应涵盖问题的原因、目标、范围、假设、方法论和实施细节。明确成功标准和需求,包括功能性和非功能性要求。在方法论部分,详细说明问题陈述、数据、技术、验证和实验。在实施部分,考虑基础设施、性能、安全、数据隐私、监控、成本和集成点。通过两个阶段的评审过程,即预评审和正式评审,可以收集有价值的反馈并改进设计。

💡 设计文档的核心目的是促进作者深入思考问题和解决方案,并通过文档形式促进反馈。撰写文档的过程本身就是一种思考和组织的过程,迫使作者全面理解问题领域,并清晰、有条理地表达自己的想法。

📝 设计文档应遵循 'Why, What, How' 框架:首先阐述解决问题的原因和重要性,包括业务价值、成功标准、需求范围和假设;接着详细说明解决问题的方法论,包括问题陈述、数据、技术、验证和实验;最后描述系统的实施细节,涵盖基础设施、性能、安全、数据隐私、监控、成本和集成等方面。

🔍 在方法论部分,需明确问题陈述,选择合适的数据和机器学习技术,并详细说明数据准备、特征工程和模型评估方法。同时,要解释如何进行离线验证和实验,以及如何设计 A/B 测试,包括成功指标和护栏指标。

🛠️ 在实施部分,需考虑系统的非功能性要求,如基础设施选择、性能指标(吞吐量和延迟)、安全性、数据隐私保护、监控和告警机制、成本估算以及与其他系统的集成方式。此外,还需评估潜在风险和不确定性,并制定应对策略。

🔄 设计文档的评审过程分为两个阶段:预评审和正式评审。预评审阶段邀请小组成员提供初步反馈,帮助作者快速迭代和改进设计;正式评审阶段邀请更多高级技术人员和决策者参与,确保设计方案的完整性和可行性。

This is part II of my answer to the top question on the topic poll (part I here): “How to write design documents for data science/machine learning projects?”

Also translated to Japanese (by Kaichi Nihira)

Design documents come in all shapes and sizes. But IMHO, they have the same purpose—to help the author think deeply about the problem and solution, and get feedback.

Thinking deeply comes with writing the design doc. To propose a good design, you have to research and understand the problem space. Then, communicating what you’ve learned via a document with different levels of detail forces you to clarify and organize your thoughts. Good writing does not come without good thinking.

“Full sentences are harder to write. They have verbs. The paragraphs have topic sentences. There is no way to write a six-page, narratively structured memo and not have clear thinking.” — Jeff Bezos

Distributing and getting feedback on design docs is also easier. They tend to be detailed, standalone documents that reviewers can read and provide comments on asynchronously. Contrast this to PowerPoint presentations which require a presenter and the audience in the same room (or now, in the same Zoom).

Is it a must to write a design doc? Of course not. But not writing one incurs the risk of building the wrong thing, or something that was requested but ends up unused. I’ve also observed costly projects halted due to design flaws discovered late in the project, because of an ill-defined problem statement or a tech choice that doesn’t scale. In hindsight, such waste could have been mitigated by investing time into writing and reviewing a design doc.

We’ll go over pointers on what to cover in design docs for machine learning systems—these pointers will guide the thinking process. My design docs tend to be structured via the Why, What, How framework shared last week (please skim it if you’ve not read it yet). Then, I’ll share how I get feedback via a two-step review process.

A simple template, available for the low price of free: ml-design-docs

The Why and What of design docs

A design doc should start by addressing the Whys and Whats.

Why should we solve this problem? Why now? Explain the motivation for your proposal and convince readers of its importance. What is the customer or business benefit? If you’re building a replacement system, explain why improvements to the existing system will not work as well. If there are alternatives, explain why your proposed system is better.

What are the success criteria? These are often framed as business goals, such as increased customer engagement, revenue, or reduced cost. They can also be framed as operational goals or new capabilities (e.g., ability to rollback models, serve features in real-time, etc.)

What are the requirements and constraints? Functional requirements are those that must be met to deliver the project. Describe them from the customer’s point of view—how will the customer experience it and/or benefit? Specific to machine learning, we’ll have specific requirements for each application, such as:

    Recommendations: Proportion of items or customers with >5 recommended items Fraud detection: Upper bound on the proportion or count of false positives Automated classification: Threshold on proportion or count of low-confidence predictions that require human review and approval

Non-functional/technical requirements define the quality of your system and determine how the system should be implemented. Usually, customers won’t notice them unless they’re not met (e.g., exceptionally high latency). Most systems will consider a similar set of requirements such as throughput, latency, security, data privacy, costs, etc.

What is in-scope vs out-of-scope? Some problems can be too big to solve all at once. To ship—and get feedback from customers—in a reasonable amount of time, we might need to chop it down to size. Be upfront about what’s out of scope. We might also need to take on tech debt to meet time and budget constraints. This is fine. Nonetheless, be deliberate about it and have a plan to pay off tech debt as soon as possible.

What are our assumptions? Make explicit your assumptions and understanding of the environment. For example, if building a recsys, how many products and users do you have? What is the expected number of requests per second? This guides how you frame the problem. It can be hard to apply reinforcement learning to large discrete action spaces (i.e., a large number of products) whereas simple approximate nearest neighbors scale well.

The How of design docs

Addressing the How in a design doc can look very different for each ML system. That said, here’s a list of things to consider in a design doc, split into two sections (methodology and implementation). These should serve as a checklist/reference and are not meant to be exhaustive. Remember, the aim of the design doc is to help you think and feedback. Thus, write whatever is necessary to achieve this goal.

Methodology: How to solve problems with data and ML

This section is similar to the methods section in machine learning papers. A couple of key points I usually cover are:

Problem statement. Declare how you’ll frame the problem. In machine learning, the same problem can have vastly different approaches. If it’s a recommender system, are you taking a content or collaboration-based approach? Will it be an item-to-item or user-to-item recommender? Is your system focused on candidate generation or ranking? Being specific helps narrow down your search space and simplifies the rest of the design doc.

Also, be clear about the problem you’re solving. For example, recommendation systems often involve solving a surrogate problem—the Netflix Challenge assumes that accurately predicting user ratings leads to effective movie recommendations. Other labels include the probability of a video being played and the number of minutes watched. The choice of your surrogate learning problem will have an outsized importance on A/B testing.

As another example, consider fraud detection. This can be solved via unsupervised or supervised approaches. An unsupervised approach won’t need labels and can adopt techniques such as outlier detection via isolation forests or identifying fraud networks via graph clustering. A supervised approach will need to consider label acquisition and how to balance between precision (more uncaught fraud) and recall (more false alarms).

Data. Describe the data and entities your ML model will be trained on. Commonly used data include customer (e.g., demographics), customer events (e.g., clicks, purchases), and items (e.g., metadata, text description, images). If you’re using customer data, pay attention to the aspects of data privacy and security (covered under implementation).

Techniques. Outline the machine learning techniques you’ll try/tried. Include baselines for comparison. This section may also include details on how you’ll clean and prepare the data, as well as your feature engineering approach. While not necessary, it’s a good idea to provide sufficient detail so that readers can implement/reproduce your work.

Validation and experimentation. Explain how you’ll evaluate models offline. (IMHO, you won’t go wrong using a time-based split most of the time.) Note the difference between leave-one-last, temporal, random, and user-based splits. Explain your choice of evaluation metrics(s) and why you think they are good proxy metrics for production conditions. If you’ve conducted experiments with validation results, include them.

If you’re conducting an A/B test, specify if treatment and control groups will be split by customers or sessions. Indicate the metrics you’ll monitor and distinguish between success and guardrail metrics. Success metrics measure the extent of the desired outcome (e.g., increased clicks, conversion, etc.) Guardrail metrics protect the overall customer experience and prevent deterioration of the system—they ensure the outcome is at least neutral (to the customer) and cannot get worse no matter how success metrics improve. (As much as possible, the offline and online metrics should be correlated, but I’ve found this more of an art than science.)

Human-in-the-loop. Indicate how human intervention can be incorporated into your system. I’ve had category managers implement rules to prevent certain product categories (e.g., adult toys, lingerie, weapons) from appearing on the home page. Conversely, customers might want to exclude themselves from recommendations (e.g., they get recommendations they don’t want seen on their home page). If it’s an automated fraud detection/loan approval system, we might also want dollar value thresholds that trigger mandatory human review and approval.

Implementation: How to build and operate the system

This section lists the non-function/technical requirements and is more engineering-heavy; it’s not necessary to address all of them. If in doubt, consult engineers for help.

High-level design. It’s a good idea to start with a diagram providing a high-level view. System-context diagrams and data-flow diagrams work well. In ML systems, some key components are data stores, pipelines (e.g., data preparation, feature engineering, training), and serving. Show how components interact with one another. I often use data-flow diagrams to show how raw data is transformed and used to train models, as well as the input and output of my model in serving.

Infra + scalability. Briefly list the infra options and your final choice. Will it run on-premise, in the cloud, or a mix of both (e.g., data processing and training on-premise for data security, model serving in the cloud for scalability). If you work in big tech with many different compute and hosting options, try to narrow down your search space early. Also, consider how your choice of infra will impact scalability—it’s easier to scale a cloud-based system than to add server racks.

Performance (throughput + latency). Address requirements on throughput (i.e., requests per second) and latency (e.g., x ms @ p99) and list how performance can be improved (e.g., pre-computation, caching). If additional throughput is required (e.g., to handle peak sales days), will you scale vertically (i.e., bigger machines) or horizontally (i.e., more machines of the same size)—your ability to do this will be tied to your choice of infra.

Security. Specify how you’ll secure your application and authenticate users and incoming requests. If your application endpoint is publicly accessible, you might want to plan for a denial-of-service attack. Organizations with centralized security teams might have an internal certification process that you can undergo to identify and patch risks.

Data privacy. Indicate how you’ll protect and ensure the privacy of customer data. Will your ML model learn on personally identifiable information (PII)? If so, detail how this PII will be stored, processed, and used in your model. Also, address how your system will comply with data retention and deletion policies such as GDPR. (I’ve built systems—in healthcare and human resources—where the PII was considered so sensitive that we declined to receive, not to mention use.)

Monitoring + alarms. Operating a system without monitoring is like driving at night without headlights—the lack of visibility is unnerving. Detail how you’ll monitor your system performance (e.g., throughput, latency, error rate, etc.) Monitoring can be done server-side (e.g., model endpoint) or client-side (e.g., consumer), with the latter including network latency. Also list the alarms that will trigger human intervention (e.g., on-call).

Cost. This will be a key concern for decision-makers who hold the purse strings. It won’t make sense if the cost of operating your system exceeds the revenue it generates. This should include labour cost—how many engineers and scientists do you need to build the system, and for how long? If your system runs in the cloud, estimate the number of instances required for data processing (e.g., EMR clusters), and model training and serving (e.g., GPU instances, AWS Lambda).

Integration points. Define how downstream services will use and interact with your endpoint. Share how the API specification looks like, and the expected input and output data. Keeping the API generic enough ensures extendability to other consuming services (i.e., higher adoption of your system).

Risks and uncertainties. Risks are the known unknowns; uncertainties are the unknown unknowns. Call them out to the best of your ability. This allows reviewers to help spot design flaws and rabbit holes, and provide feedback on how to avoid/address them.

Other stuff. There’s a non-exhaustive list of other concerns that might be relevant to your system. This includes ops strategy (e.g., monitoring, on-call), model rollbacks, quality assurance, extensibility, and model footprint and power consumption (if used in mobile apps). Address them if they are key to your system.

Alternatives considered and rejected

It’s useful to include a section on alternatives you’ve considered but rejected. List their pros and cons as well as the rationale for your decision. Your decision will be based on your assumptions about the environment and the requirements, so it’s good to document it down. If the environment changes, this section can help you reconsider past decisions.

This section helps you dive into the ambiguous, hidden choices and the implicit decisions made while designing your system. Being transparent allows others to check your blind spots and correct invalid assumptions. The aim is to suggest improvements to your design early, saving you from making bad or unnecessarily difficult design choices.

A disclaimer on design doc templates

By now, you might be looking for a template. I would too. Nonetheless, I’ve come to realize there’s no perfect template for design docs. The optimal structure and sections will differ for each ML system.

My experience has also made me hesitant about providing and using templates. Followed blindly, templates lead to authors playing fill-in-the-blanks. As a result, they overly focus on filling in the sections, and neglect aspects unique to their system or application. Conversely, some templates have redundant sections where you rehash information that your audience already knows (e.g., organization/team tech stack)—this is a waste of document real estate and your readers’ time.

Writing with the intent of thinking deeply and getting feedback helps with crafting a good design doc. Focus on what you and others need to know before implementing the system; share enough detail so others can provide meaningful feedback. Doing this usually leads to good design docs. Also, use the pointers above to guide you.

Nonetheless, if you're still looking for a template to guide your thinking, here's a minimal template that's deliberately lean.

Reviewing design docs in two stages

I find it helpful to conduct reviews in two stages: pre-review and review.

Pre-review involves quickly iterating and seeking feedback from a small group (often as part of the writing process). At this stage, the design doc might be a tad rough around the edges, with open questions and paths to explore. Nonetheless, my reviewers express a preference for being involved early as the raw and fluid state (of the design) allows them to provide feedback that meaningfully shapes the direction of the system. This is the stage where mentors and seniors can help to narrow the search space and simplify the design.

At this stage, the document will likely be low resolution and lacking in details—this is a feature, not a bug, and allow for quickly brainstorming and iterating through alternatives. Mine looks like an outline of the eventual design doc, with most of the details and feedback in bullet form. Much of it doesn’t make it into the final design doc.

I tend to conduct pre-reviews one-on-one in a casual setting, usually with individual team members or mentors. If you’re doing this, be clear that it’s the pre-review phase. I’ve caused unnecessary concern when a pre-reviewer thought he was reading the final design doc (when it was just the first iteration).

The review will be more formal and involve a larger audience of senior technical folk and decision-makers. Be clear what you want from the review. What risks/uncertainties need to be addressed? What decisions need to be made? What help do you need? If you’ve done your pre-review well, it shouldn’t be too common to make major design changes at this stage.

At this stage, the design doc should have the necessary details and be in structured prose. Quantified estimates (e.g., throughput, latency, cost) and offline experiment results (e.g., hit@10, nDCG) will be very helpful. Diagrams are a must. Questions asked by pre-reviewers can be addressed in the appendix via a FAQ section.

Scott wrote a post that includes suggestions on how to conduct meetings (at Amazon) such as “having the right people in the room” and “checking your ego at the door”. I think much of it applies to design doc reviews as well so I’ll refer you to his post.

Conclusion

Writing design docs is overhead. Minor changes (e.g., adding a feature column) or low-effort tasks (e.g., a few days) shouldn’t need a design doc—the cost of writing a full design doc will outweigh the benefits. Alternatively, prototyping can be a feasible approach for smaller systems.

Nonetheless, it can be useful to write a design doc when:

    The problem and/or solution is ambiguous or not well understood (e.g., blockchain) The impact is high (e.g., customer-facing, downstream impact on other services) The implementation effort is high (e.g., multiple teams for a few months)

Whether you’re writing your first or 20th design doc for a machine learning system, I hope this write-up will be helpful for you. Did I miss anything? Reach out @eugeneyan!


Thanks to Yang Xinyi for reading drafts of this.

If you found this useful, please cite this write-up as:

Yan, Ziyou. (Mar 2021). How to Write Design Docs for Machine Learning Systems. eugeneyan.com. https://eugeneyan.com/writing/ml-design-docs/.

or

@article{yan2021design,  title   = {How to Write Design Docs for Machine Learning Systems},  author  = {Yan, Ziyou},  journal = {eugeneyan.com},  year    = {2021},  month   = {Mar},  url     = {https://eugeneyan.com/writing/ml-design-docs/}}
Share on:

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

设计文档 数据科学 机器学习 项目规划 问题解决 反馈机制
相关文章