https://eugeneyan.com/rss 09月30日
医院机器学习系统搭建实践
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了为东南亚最大医院集团搭建机器学习系统的幕后过程。从数据采集、加密传输到模型训练、部署及监控,详细阐述了全流程设计。重点讨论了时间序列验证的重要性,以及如何通过学习领域知识、优化数据策略和工程实践提升模型效果。强调机器学习只是系统的一小部分,方法论和工程流程更为关键。

🏥系统采用数据分阶段处理流程:医院本地存储原始数据(患者、医生、案例信息),定期加密传输至SFTP服务器,经解密验证后存入暂存区,再解析入库。数据科学工作流从数据库读取数据,经预处理、特征工程、模型训练与验证,最终模型以pickle、PyTorch或Docker镜像格式存储,通过版本控制部署至发布服务器,各医院独立部署模型端点并记录预测结果,形成完整闭环。

⏳强调生产环境应优先采用时间序列验证,避免随机分割导致数据泄露问题。文中案例因患者多次就诊,随机分割可能使用未来数据预测过去,而时间分割能更真实反映模型在真实场景下的表现,尤其适用于医疗这类强时序数据。

👥构建模型前深入医院一线:作者通过观察护理和前台人员使用现有及新系统,收集其操作假设和经验法则,转化为模型新特征,显著降低误差。同时咨询医院管理员优化数据清洗方法,进一步提升准确性。

📈数据量与特征并非越多越好:作者尝试合并所有医院数据构建单一模型失败,改为按医院独立建模,误差显著降低。原因是各医院在定价、专科、医生团队等方面存在差异,单一模型难以捕捉这些异质性。

🛠️重视工程实践提升效率:采用Git进行模型版本控制便于回滚,Docker封装模型端点实现水平扩展,ELK日志系统提供全流程可见性。这些工程实践极大简化了部署和运维工作,凸显方法论和工程流程的重要性。

This week, I had the opportunity to give a behind the scenes tour of building a machine learning system for Southeast Asia’s largest hospital group, at the DataTalks.Club conference. The video is available below. We’ll discuss a few key points in this post.


• • •

Overall design: From data to endpoint, and back to data

Design of the machine learning system, from data to model endpoint.

The grey dashed line (on the left) separates the hospitals’ environment from ours. Hospitals store the data, include patient, doctor, and case information. This data is periodically extracted, encrypted, and transported to our SFTP server. We then decrypt, validate, and store the raw, unstructured data (e.g., CSVs, parquet files, etc.) in our staging area. The raw data is then parsed and stored in our RDBMS.

The data science workflow reads the data (from the tabular databases) and puts it through the pipeline, including data preparation, feature engineering, machine learning, and model validation. (For more details, please view the talk.)

The trained model is stored in our model store. Models can take various formats, such as pickle files, library-specific formats (e.g., PyTorch), or Docker images that include serving code. These models are then pushed to the publishing server, which versions and deploys the models. Model versioning lets us easily rollback model deployments, providing an insurance policy (in case a deployment goes awry).

Each hospital has a unique model endpoint and stores the predictions received (we also log these predictions on our end). This completes the entire cycle and provides a feedback loop to validate and improve model performance.

Logging and monitoring is done via the ELK stack, orchestration via Airflow, and endpoint publishing via Jenkins. Implementation was done in Python and deployed on Azure (a common cloud provider used by healthcare organizations in Southeast Asia).

Most prod systems should use time-based validatiohttps://eugeneyan.com/assets/datax-validation.jpg/datax-validation.jpg" loading="The difference between a random-split and a time-based split" title="The difference between a random-split and a time-based split" alt="Image">

The difference between a random-split and a time-based split.

In most production systems, our data will have a strong temporal aspect. Thus, using random train-test split or cross-validation will give an overly optimistic result. This was especially so in our data where many patients had more than one visit, and a random split might lead to using future hospital visits to predict past hospital visits. Thus, a time-based split should be used instead.

There are exceptions to this though. For example, a product classifier which categorizes products based on image or text (e.g., title, description). The relationship between the images/text and product classification is likely stable over time. Thus, there probably isn’t much difference between a random-splithttps://eugeneyan.comhttps://eugeneyan.com/assets/datax-validation2.jpg

Data leaks happen if we're not using a time-based validation split.

Miscellaneous tips and key takeaways

During the Q&A, the conference organizer, Alexey, chimed that the slides on miscellaneous tips and key takeaways were probably the mosthttps://eugeneyan.comhttps://eugeneyan.com/assets/datax-misc.jpgere’s a brief rundown:

Three simple bullet points that had outsized impact on overall outcomes.

Take time to learn from domain experts and users. While developing and improving our system, I visited the hospitals and observed the nursing and counter staff use their existing system, and eventually, our new system. This helped me learn about the assumptions and heuristics they adopted while using the systems, providing insight into new features (for our ML model) that reduced error. I also consulted hospital administrators for advice on how to better understand and clean the data—this also helped reduce error.

More data and/or features != better model. My initial approach was to use all the data from the various hospitals to build a single model—I assumed the model could learn about the idiosyncrasies of each hospital. This was not the case (or perhaps my models were underpowered). Building a separate model for each hospital worked much better (i.e., significantly lower error). In hindsight, this makes sense as each hospital was different in price point (e.g., 5-star hotel vs. public hospital), surgery and disease specialization, doctors, etc.

Proper engineering practices make life easier. Beyond engineering practices in our code (e.g., test cases, linting, type checking, etc.), adopting version control on model artifacts let us rollback “defective” models if we ever needed to (we never did). Docker helped us encapsulate and horizontally scale our model endpoints, while the ELK stack provided much-needed visibility across our entire pipeline. All this made deploying and operating the ML system much easier.

Machine learning is <20% of the effort—the methodology and engineering process are more important. Much of the improvements to the system came from engaging with stakeholders to understand the problem, defining the right proxy metrics, and grokking the data. Also, as the architecture diagram shows, machine learning is really only a small part of the overall system. And I didn’t cover the topics of infra, networking, maintenance, monitoring, etc.

I shared a similar sentiment on Vicki Boykis’ question, and some people seemed to agree.

• • •

A common bottleneck I see in start-ups and SMEs is to focus too much on applying machine learning on batch data, usually CSVs, to achieve artificially strong results that are overfitted and not replicable in production. I hope this talk provided a greater understanding of how to avoid this, and develop a production-grade ML system.

If you found this useful, please cite this write-up as:

Yan, Ziyou. (Feb 2021). DataTalksClub - Building an ML System; Behind the Scenes. eugeneyan.com. https://eugeneyan.com/speaking/machine-learning-in-healthcare/.

or

@article{yan2021healthcare,  title   = {DataTalksClub - Building an ML System; Behind the Scenes},  author  = {Yan, Ziyou},  journal = {eugeneyan.com},  year    = {2021},  month   = {Feb},  url     = {https://eugeneyan.com/speaking/machine-learning-in-healthcare/}}
Share on:

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

机器学习系统 医疗健康AI 时间序列验证 工程实践 模型部署 DataTalksClub Behind the Scenes Machine Learning in Healthcare
相关文章