AWS Machine Learning Blog 09月25日
AWS深度学习容器与MLflow集成方案
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

AWS深度学习容器(DLCs)与SageMaker管理的MLflow集成方案,为组织提供定制化机器学习环境的灵活性,同时实现强大的模型治理。DLCs提供预配置的Docker容器,包含TensorFlow和PyTorch等框架,优化AWS性能。SageMaker管理的MLflow提供自动日志记录、增强比较能力和完整谱系跟踪,简化ML生命周期管理。该方案通过Amazon EC2、S3和ECR等AWS服务,实现可扩展的ML开发环境,支持TensorFlow神经网络模型训练和实验跟踪,显著降低资源消耗,提高效率。

💻 AWS深度学习容器(DLCs)提供预配置的Docker容器,包含TensorFlow和PyTorch等框架,优化AWS性能,满足特定ML环境需求。

📊 SageMaker管理的MLflow提供自动日志记录、增强比较能力和完整谱系跟踪,简化ML生命周期管理,支持模型治理和审计。

🔧 该方案通过Amazon EC2、S3和ECR等AWS服务,构建可扩展的ML开发环境,支持TensorFlow神经网络模型训练和实验跟踪,降低资源消耗,提高效率。

📈 集成DLCs与MLflow,实现基础设施控制与ML治理的平衡,支持定制化训练环境,同时确保模型开发流程的标准化和可追溯性。

🌐 适用于需要特定硬件配置、软件版本和安全管理的数据密集型应用,如金融交易算法、医疗数据分析等,加速模型从实验到商业应用的进程。

Organizations building custom machine learning (ML) models often have specialized requirements that standard platforms can’t accommodate. For example, healthcare companies need specific environments to protect patient data while meeting HIPAA compliance, financial institutions require specific hardware configurations to optimize proprietary trading algorithms, and research teams need flexibility to experiment with cutting-edge techniques using custom frameworks. These specialized needs drive organizations to build custom training environments that give them control over hardware selection, software versions, and security configurations.

These custom environments provide the necessary flexibility, but they create significant challenges for ML lifecycle management. Organizations typically try to solve these problems by building additional custom tools, and some teams piece together various open source solutions. These approaches further increase operational costs and require engineering resources that could be better used elsewhere.

AWS Deep Learning Containers (DLCs) and managed MLflow on Amazon SageMaker AI offer a powerful solution that addresses both needs. DLCs provide preconfigured Docker containers with frameworks like TensorFlow and PyTorch, including NVIDIA CUDA drivers for GPU support. DLCs are optimized for performance on AWS, regularly maintained to include the latest framework versions and patches, and designed to integrate seamlessly with AWS services for training and inference. AWS Deep Learning AMIs (DLAMIs) are preconfigured Amazon Machine Images (AMIs) for Amazon Elastic Compute Cloud (Amazon EC2) instances. DLAMIs come with popular deep learning frameworks like PyTorch and TensorFlow, and are available for CPU-based instances and high-powered GPU-accelerated instances. They include NVIDIA CUDA, cuDNN, and other necessary tools, with AWS managing the updates of DLAMIs. Together, DLAMIs and DLCs provide ML practitioners with the infrastructure and tools to accelerate deep learning in the cloud at scale.

SageMaker managed MLflow delivers comprehensive lifecycle management with one-line automatic logging, enhanced comparison capabilities, and complete lineage tracking. As a fully managed service on SageMaker AI, it alleviates the operational burden of maintaining tracking infrastructure.

In this post, we show how to integrate AWS DLCs with MLflow to create a solution that balances infrastructure control with robust ML governance. We walk through a functional setup that your team can use to meet your specialized requirements while significantly reducing the time and resources needed for ML lifecycle management.

Solution overview

In this section, we describe the architecture and AWS services used to integrate AWS DLCs with SageMaker managed MLflow to implement the solution.The solution uses several AWS services together to create a scalable environment for ML development:

For this use case, you will develop a TensorFlow neural network model for abalone age prediction with integrated SageMaker managed MLflow tracking code. Next, you will pull an optimized TensorFlow training container from the AWS public ECR repository and configure an EC2 instance with access to the MLflow tracking server. You will then execute the training process within the DLC while storing model artifacts in Amazon S3 and logging experiment results to MLflow. Finally, you will view and compare experiment results in the MLflow UI to evaluate model performance.

The following diagram that shows the interaction between various AWS services, AWS DLCs, and SageMaker managed MLflow for the solution.

The workflow consists of the following steps:

    Develop a TensorFlow neural network model for abalone age prediction. Integrate SageMaker managed MLflow tracking within the model code to log parameters, metrics, and artifacts. Pull an optimized TensorFlow training container from the AWS public ECR repository. Configure Amazon EC2 and DLAMI with access to the MLflow tracking server using an AWS Identity and Access Management (IAM) role for EC2. Execute the training process within the DLC running on Amazon EC2, store model artifacts in Amazon S3, and log all experiment results and register model in MLflow. Compare experiment results through the MLflow UI.

Prerequisites

To follow along with this walkthrough, make sure you have the following prerequisites:

Deploy the solution

Detailed step-by-step instructions are available in the accompanying GitHub repository’s README file. The walkthrough covers the entire workflow—from provisioning infrastructure and setting up permissions to executing your first training job with comprehensive experiment tracking.

Analyze experiment results

After you’ve implemented the solution following the steps in the README file, you can access and analyze your experiment results. The following screenshots demonstrate how SageMaker managed MLflow provides comprehensive experiment tracking, model governance, and auditability for your deep learning workloads. When training is complete, all experiment metrics, parameters, and artifacts are automatically captured in MLflow, providing a central location to track and compare your model development journey. The following screenshot shows the experiment abalone-tensorflow-experiment with a run named unique-cod-104. This dashboard gives you a complete overview of all experiment runs, so you can compare different approaches and model iterations at a glance.

The following screenshot shows the detailed information for run unique-cod-104, including the registered model abalone-tensorflow-custom-callback-model (version v2). This view provides critical information about model provenance, showing exactly which experiment run produced which model version, which is a key component of model governance.

The following visualization tracks the training loss across epochs, captured using a custom callback. Such metrics help you understand model convergence patterns and evaluate training performance, giving insights into potential optimization opportunities.

The registered models view illustrated in the following screenshot shows how abalone-tensorflow-custom-callback-model is tracked in the model registry. This integration enables versioning, model lifecycle management, and deployment tracking.

The following screenshot illustrates one of the solution’s most powerful governance features. When logging a model with mlflow.tensorflow.log_model() using the registered_model_name parameter, the model is automatically registered in the Amazon SageMaker Model Registry. This creates full traceability from experiment to deployed model, establishing an audit trail that connects your training runs directly to production models.

This seamless integration between your custom training environments and SageMaker governance tools helps you maintain visibility and compliance throughout your ML lifecycle.

The model artifacts are automatically uploaded to Amazon S3 after training completion, as illustrated in the following screenshot. This organized storage structure makes sure all model components including weights, configurations, and associated metadata are securely preserved and accessible through a standardized path.

Cost implications

The following resources incur costs. Refer to the respective pricing page to estimate costs.

Clean up

Complete the following steps to clean up your resources:

    Delete the MLflow tracking server, because it continues to incur costs as long as it’s running:
aws sagemaker delete-mlflow-tracking-server \    --tracking-server-name <your-tracking-server-name>
    Stop the EC2 instance to avoid incurring additional costs:
aws ec2 stop-instances --instance-ids <your-instance-id>
    Remove training data, model artifacts, and MLflow experiment data from S3 buckets:
aws s3 rm s3://<your-bucket>/<your-MLflow-folder> –recursive
    Review and clean up any temporary IAM roles created for the EC2 instnce. Delete your SageMaker Studio domain.

Conclusion

AWS DLCs and SageMaker managed MLflow provide ML teams a solution that balances the trade-off between governance and flexibility. This integration helps data scientists seamlessly track experiments and deploy models for inference, and helps administrators establish secure, scalable SageMaker managed MLflow environments. Organizations can now standardize their ML workflows using either AWS DLCs or DLAMIs while accommodating specialized requirements, ultimately accelerating the journey from model experimentation to business impact with greater control and efficiency.

In this post, we explored how to integrate custom training environments with SageMaker managed MLflow to gain comprehensive experiment tracking and model governance. This approach maintains the flexibility of your preferred development environment while benefiting from centralized tracking, model registration, and lineage tracking. The integration provides a perfect balance between customization and standardization, so teams can innovate while maintaining governance best practices.

Now that you understand how to track training in DLCs with SageMaker managed MLflow, you can implement this solution in your own environment. All code examples and implementation details from this post are available in our GitHub repository.

For more information, refer to the following resources:


About the authors

Gunjan Jain, an AWS Solutions Architect based in Southern California, specializes in guiding large financial services companies through their cloud transformation journeys. He expertly facilitates cloud adoption, optimization, and implementation of Well-Architected best practices. Gunjan’s professional focus extends to machine learning and cloud resilience, areas where he demonstrates particular enthusiasm. Outside of his professional commitments, he finds balance by spending time in nature.

Rahul Easwar is a Senior Product Manager at AWS, leading managed MLflow and Partner AI Apps within the SageMaker AIOps team. With over 15 years of experience spanning startups to enterprise technology, he leverages his entrepreneurial background and MBA from Chicago Booth to build scalable ML platforms that simplify AI adoption for organizations worldwide. Connect with Rahul on LinkedIn to learn more about his work in ML platforms and enterprise AI solutions.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AWS Deep Learning Containers MLflow SageMaker 机器学习 深度学习 模型治理 自定义训练环境
相关文章