AWS Machine Learning Blog 09月23日 01:58
SageMaker与Comet集成,助力企业ML模型开发与管理
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了Amazon SageMaker与Comet平台的集成方案,旨在帮助企业在扩展机器学习(ML)项目时,解决实验管理、模型溯源和可复现性等复杂问题。通过将Comet作为SageMaker AI的合作伙伴应用,企业能够获得端到端的ML工作流支持。SageMaker负责基础设施和计算资源的提供,而Comet则专注于实验跟踪、模型注册和生产监控,确保在日益严格的AI监管环境下,模型开发过程的合规性和效率。文章详细阐述了从管理员到用户在SageMaker AI中设置和使用Comet的流程,并通过一个欺诈检测用例展示了该集成如何实现数据版本控制、模型训练、评估和部署的全过程自动化与可追溯性。

🚀 **无缝集成与部署**: SageMaker AI作为托管基础设施,提供计算资源和分布式训练能力,而Comet作为SageMaker AI的合作伙伴应用,提供全面的ML实验管理功能。这种集成使得企业能够在一个统一的平台上,从模型开发到生产部署,实现端到端的ML工作流管理,简化了跨团队协作和环境配置的复杂性。

📊 **自动化实验跟踪与可复现性**: Comet能够自动捕获ML实验中的代码、依赖库、系统元数据和超参数等关键信息,为模型的可复现性提供了坚实的基础。通过Artifact功能,Comet还能自动进行数据集的版本控制和血缘跟踪,确保所有模型训练都基于清晰可追溯的数据集,这对于满足合规性要求至关重要。

📈 **端到端的可视化与协作**: 集成后的平台提供了强大的可视化工具,允许用户轻松比较不同实验的性能指标,并支持自定义面板进行深度分析。这不仅加速了模型优化过程,还促进了团队成员之间的有效协作,共同解决复杂的ML问题,从而提升整体开发效率。

🔒 **满足企业级合规与治理需求**: 结合SageMaker的安全ML环境和Comet完整的模型血缘和数据溯源能力,该集成方案能够帮助企业在受监管的环境中大规模扩展ML项目。详细的审计日志和模型生命周期跟踪,确保了模型开发过程的透明度和合规性,有效规避了因无法重现结果而可能产生的代价高昂的错误。

This post was written with Sarah Ostermeier from Comet.

As enterprise organizations scale their machine learning (ML) initiatives from proof of concept to production, the complexity of managing experiments, tracking model lineage, and managing reproducibility grows exponentially. This is primarily because data scientists and ML engineers constantly explore different combinations of hyperparameters, model architectures, and dataset versions, generating massive amounts of metadata that must be tracked for reproducibility and compliance. As the ML model development scales across multiple teams and regulatory requirements intensify, tracking experiments becomes even more complex. With increasing AI regulations, particularly in the EU, organizations now require detailed audit trails of model training data, performance expectations, and development processes, making experiment tracking a business necessity and not just a best practice.

Amazon SageMaker AI provides the managed infrastructure enterprises need to scale ML workloads, handling compute provisioning, distributed training, and deployment without infrastructure overhead. However, teams still need robust experiment tracking, model comparison, and collaboration capabilities that go beyond basic logging.

Comet is a comprehensive ML experiment management platform that automatically tracks, compares, and optimizes ML experiments across the entire model lifecycle. It provides data scientists and ML engineers with powerful tools for experiment tracking, model monitoring, hyperparameter optimization, and collaborative model development. It also offers Opik, Comet’s open source platform for LLM observability and development.

Comet is available in SageMaker AI as a Partner AI App, as a fully managed experiment management capability, with enterprise-grade security, seamless workflow integration, and a straightforward procurement process through AWS Marketplace.

The combination addresses the needs of an enterprise ML workflow end-to-end, where SageMaker AI handles infrastructure and compute, and Comet provides the experiment management, model registry, and production monitoring capabilities that teams require for regulatory compliance and operational efficiency. In this post, we demonstrate a complete fraud detection workflow using SageMaker AI with Comet, showcasing reproducibility and audit-ready logging needed by enterprises today.

Enterprise-ready Comet on SageMaker AI

Before proceeding to setup instructions, organizations must identify their operating model and based on that, decide how Comet is going to be set up. We recommend implementing Comet using a federated operating model. In this architecture, Comet is centrally managed and hosted in a shared services account, and each data science team maintains fully autonomous environments. Each operating model comes with their own sets of benefits and limitations. For more information, refer to SageMaker Studio Administration Best Practices.

Let’s dive into the setup of Comet in SageMaker AI. Large enterprise generally have the following personas:

In the following sections, we go through each persona’s journey.

Comet works well with both SageMaker AI and Amazon SageMaker. SageMaker AI provides the Amazon SageMaker Studio integrated development environment (IDE), and SageMaker provides the Amazon SageMaker Unified Studio IDE. For this post, we use SageMaker Studio.

Administrator journey

In this scenario, the administrator receives a request from a team working on a fraud detection use case to provision an ML environment with a fully managed training and experimentation setup. The administrator’s journey includes the following steps:

    Follow the prerequisites to set up Partner AI Apps. This sets up permissions for administrators, allowing Comet to assume a SageMaker AI execution role on behalf of the users and additional privileges for managing the Comet subscription through AWS Marketplace. On the SageMaker AI console, under Applications and IDEs in the navigation pane, choose Partner AI Apps, then choose View details for Comet.

The details are shown, including the contract pricing model for Comet and infrastructure tier estimated costs.

Comet provides different subscription options ranging from a 1-month to 36-month contract. With this contract, users can access Comet in SageMaker. Based on the number of users, the admin can review and analyze the appropriate instance size for the Comet dashboard server. Comet supports 5–500 users running more than 100 experiment jobs..

    Choose Go to Marketplace to subscribe to be redirected to the Comet listing on AWS Marketplace. Choose View purchase options.

    In the subscription form, provide the required details.

When the subscription is complete, the admin can start configuring Comet.

    While deploying Comet, add the project lead of the fraud detection use case team as an admin to manage the admin operations for the Comet dashboard.

It takes a few minutes for the Comet server to be deployed. For more details on this step, refer to Partner AI App provisioning.

    Set up a SageMaker AI domain following the steps in Use custom setup for Amazon SageMaker AI. As a best practice, provide a pre-signed domain URL for the use case team member to directly access the Comet UI without logging in to the SageMaker console. Add the team members to this domain and enable access to Comet while configuring the domain.

Now the SageMaker AI domain is ready for users to log in to and start working on the fraud detection use case.

User journey

Now let’s explore the journey of an ML practitioner from the fraud detection use case. The user completes the following steps:

    Log in to the SageMaker AI domain through the pre-signed URL.

You will be redirected to the SageMaker Studio IDE. Your user name and AWS Identity and Access Management (IAM) execution role are preconfigured by the admin.

    Create a JupyterLab Space following the JupyterLab user guide. You can start working on the fraud detection use case by spinning up a Jupyter notebook.

The admin has also set up required access to the data through an Amazon Simple Storage Service (Amazon S3) bucket.

    To access Comet APIs, install the comet_ml library and configure the required environment variables as described in Set up the Amazon SageMaker Partner AI Apps SDKs. To access the Comet UI, choose Partner AI Apps in the SageMaker Studio navigation pane and choose Open for Comet.

Now, let’s walk through the use case implementation.

Solution overview

This use case highlights common enterprise challenges: working with imbalanced datasets (in this example, only 0.17% of transactions are fraudulent), requiring multiple experiment iterations, and maintaining full reproducibility for regulatory compliance. To follow along, refer to the Comet documentation and Quickstart guide for additional setup and API details.

For this use case, we use the Credit Card Fraud Detection dataset. The dataset contains credit card transactions with binary labels representing fraudulent (1) or legitimate (0) transactions. In the following sections, we walk through some of the important sections of the implementation. The entire code of the implementation is available in the GitHub repository.

Prerequisites

As a prerequisite, configure the necessary imports and environment variables for the Comet and SageMaker integration:

# Comet ML for experiment trackingimport comet_mlfrom comet_ml import Experiment, API, Artifactfrom comet_ml.integration.sagemaker import log_sagemaker_training_job_v1AWS_PARTNER_APP_AUTH=trueAWS_PARTNER_APP_ARN=<Your_AWS_PARTNER_APP_ARN>COMET_API_KEY=<Your_Comet_API_Key>     # From Details Page, click Open Comet. In the top #right corner, click on user -> API # Key# Comet ML configurationCOMET_WORKSPACE = '<your-comet-workspace-name>'COMET_PROJECT_NAME = '<your-comet-project-name>'

Prepare the dataset

One of Comet’s key enterprise features is automatic dataset versioning and lineage tracking. This capability provides full auditability of what data was used to train each model, which is critical for regulatory compliance and reproducibility. Start by loading the dataset:

# Create a Comet Artifact to track our raw datasetdataset_artifact = Artifact(    name="fraud-dataset",    artifact_type="dataset",    aliases=["raw"])# Add the raw dataset file to the artifactdataset_artifact.add_remote(s3_data_path, metadata={    "dataset_stage": "raw",     "dataset_split": "not_split",     "preprocessing": "none"})

Start a Comet experiment

With the dataset artifact created, you can now start tracking the ML workflow. Creating a Comet experiment automatically begins capturing code, installed libraries, system metadata, and other contextual information in the background. You can log the dataset artifact created earlier in the experiment. See the following code:

# Create a new Comet experimentexperiment_1 = comet_ml.Experiment(    project_name=COMET_PROJECT_NAME,    workspace=COMET_WORKSPACE,)# Log the dataset artifact to this experiment for lineage trackingexperiment_1.log_artifact(dataset_artifact)

Preprocess the data

The next steps are standard preprocessing steps, including removing duplicates, dropping unneeded columns, splitting into train/validation/test sets, and standardizing features using scikit-learn’s StandardScaler. We wrap the processing code in preprocess.py and run it as a SageMaker Processing job. See the following code:

# Run SageMaker processing jobprocessor = SKLearnProcessor(    framework_version='1.0-1',    role=sagemaker.get_execution_role(),    instance_count=1,    instance_type='ml.t3.medium')processor.run(    code='preprocess.py',    inputs=[ProcessingInput(source=s3_data_path, destination='/opt/ml/processing/input')],    outputs=[ProcessingOutput(source='/opt/ml/processing/output', destination=f's3://{bucket_name}/{processed_data_prefix}')])

After you submit the processing job, SageMaker AI launches the compute instances, processes and analyzes the input data, and releases the resources upon completion. The output of the processing job is stored in the S3 bucket specified.

Next, create a new version of the dataset artifact to track the processed data. Comet automatically versions artifacts with the same name, maintaining complete lineage from raw to processed data.

# Create an updated version of the 'fraud-dataset' Artifact for the preprocessed datapreprocessed_dataset_artifact = Artifact(    name="fraud-dataset",    artifact_type="dataset",     aliases=["preprocessed"],    metadata={        "description": "Credit card fraud detection dataset",        "fraud_percentage": f"{fraud_percentage:.3f}%",        "dataset_stage": "preprocessed",        "preprocessing": "StandardScaler + train/val/test split",    })# Add our train, validation, and test dataset files as remote assets preprocessed_dataset_artifact.add_remote(    uri=f's3://{bucket_name}/{processed_data_prefix}',    logical_path='split_data')# Log the updated dataset to the experiment to track the updatesexperiment_1.log_artifact(preprocessed_dataset_artifact)

The Comet and SageMaker AI experiment workflow

Data scientists prefer rapid experimentation; therefore, we organized the workflow into reusable utility functions that can be called multiple times with different hyperparameters while maintaining consistent logging and evaluation across all runs. In this section, we showcase the utility functions along with a brief snippet of the code inside the function:

    # Create SageMaker estimator    estimator = Estimator(        image_uri=xgboost_image,        role=execution_role,        instance_count=1,        instance_type='ml.m5.large',        output_path=model_output_path,        sagemaker_session=sagemaker_session_obj,        hyperparameters=hyperparameters_dict,        max_run=1800  # Maximum training time in seconds    )    # Start training    estimator.fit({        'train': train_channel,        'validation': val_channel    })
# Log SageMaker training job to Comet     log_sagemaker_training_job_v1(        estimator=training_estimator,        experiment=api_experiment    )
experiment.log_remote_model(        model_name=model_name,        uri=model_artifact_path,        metadata=metadata    )
# Deploy to endpointpredictor = estimator.deploy(initial_instance_count=1,       instance_type="ml.m5.xlarge")# Log metrics and visualizations to Comet experiment.log_metrics(metrics) experiment.log_confusion_matrix(matrix=cm,labels=['Normal', 'Fraud']) # Log ROC curve fpr, tpr, _ = roc_curve(y_test, y_pred_prob_as_np_array) experiment.log_curve("roc_curve", x=fpr, y=tpr)

The complete prediction and evaluation code is available in the GitHub repository.

Run the experiments

Now you can run multiple experiments by calling the utility functions with different configurations and compare experiments to find the most optimal settings for the fraud detection use case.

For the first experiment, we establish a baseline using standard XGBoost hyperparameters:

# Define hyperparameters for first experimenthyperparameters_v1 = {    'objective': 'binary:logistic', # Binary classification    'num_round': 100,                   # Number of boosting rounds    'eval_metric': 'auc',               # Evaluation metric    'learning_rate': 0.15,              # Learning rate    'booster': 'gbtree'                 # Booster algorithm}# Train the modelestimator_1 = train(    model_output_path=f"s3://{bucket_name}/{model_output_prefix}/1",    execution_role=role,    sagemaker_session_obj=sagemaker_session,    hyperparameters_dict=hyperparameters_v1,    train_channel_loc=train_channel_location,    val_channel_loc=validation_channel_location)# log the training job and model artifactlog_training_job(experiment_key = experiment_1.get_key(), training_estimator=estimator_1)log_model_to_comet(experiment = experiment_1,                   model_name="fraud-detection-xgb-v1",                    model_artifact_path=estimator_1.model_data,                    metadata=metadata)# Deploy and evaluatedeploy_and_evaluate_model(experiment=experiment_1,                          estimator=estimator_1,                          X_test_scaled=X_test_scaled,                          y_test=y_test                          )

While running a Comet experiment from a Jupyter notebook, we need to end the experiment to make sure everything is captured and persisted in the Comet server. See the following code: experiment_1.end()

When the baseline experiment is complete, you can run additional experiments with different hyperparameters. Check out the notebook to see the details of both experiments.

When the second experiment is complete, navigate to the Comet UI to compare these two experiment runs.

View Comet experiments in the UI

To access the UI, you can locate the URL in the SageMaker Studio IDE or by executing the code provided in the notebook: experiment_2.url

The following screenshot shows the Comet experiments UI. The experiment details are for illustration purposes only and do not represent a real-world fraud detection experiment.

This concludes the fraud detection experiment.

Clean up

For the experimentation part, SageMaker processing and training infrastructure is ephemeral in nature and shuts down automatically when the job is complete. However, you must still manually clean up a few resources to avoid unnecessary costs:

    Shut down the SageMaker JupyterLab Space after use. For instructions, refer to Idle shutdown. The Comet subscription renews based on the contract chosen. Cancel the contract when there is no further requirement to renew the Comet subscription.

Advantages of SageMaker and Comet integration

Having demonstrated the technical workflow, let’s examine the broader advantages this integration provides.

Streamlined model development

The Comet and SageMaker combination reduces the manual overhead of running ML experiments. While SageMaker handles infrastructure provisioning and scaling, Comet’s automatic logging captures hyperparameters, metrics, code, installed libraries, and system performance from your training jobs without additional configuration. This helps teams focus on model development rather than experiment bookkeeping.Comet’s visualization capabilities extend beyond basic metric plots. Built-in charts enable rapid experiment comparison, and custom Python panels support domain-specific analysis tools for debugging model behavior, optimizing hyperparameters, or creating specialized visualizations that standard tools can’t provide.

Enterprise collaboration and governance

For enterprise teams, the combination creates a mature platform for scaling ML projects across regulated environments. SageMaker provides consistent, secure ML environments, and Comet enables seamless collaboration with complete artifact and model lineage tracking. This helps avoid costly mistakes that occur when teams can’t recreate previous results.

Complete ML lifecycle integration

Unlike point solutions that only address training or monitoring, Comet paired with SageMaker supports your complete ML lifecycle. Models can be registered in Comet’s model registry with full version tracking and governance. SageMaker handles model deployment, and Comet maintains the lineage and approval workflows for model promotion. Comet’s production monitoring capabilities track model performance and data drift after deployment, creating a closed loop where production insights inform your next round of SageMaker experiments.

Conclusion

In this post, we showed how to use SageMaker and Comet together to spin up fully managed ML environments with reproducibility and experiment tracking capabilities.

To enhance your SageMaker workflows with comprehensive experiment management, deploy Comet directly in your SageMaker environment through the AWS Marketplace, and share your feedback in the comments.

For more information about the services and features discussed in this post, refer to the following resources:


About the authors

Vikesh Pandey is a Principal GenAI/ML Specialist Solutions Architect at AWS, helping large financial institutions adopt and scale generative AI and ML workloads. He is the author of book “Generative AI for financial services.” He carries more than 15 years of experience building enterprise-grade applications on generative AI/ML and related technologies. In his spare time, he plays an unnamed sport with his son that lies somewhere between football and rugby.

Naufal Mir is a Senior GenAI/ML Specialist Solutions Architect at AWS. He focuses on helping customers build, train, deploy and migrate machine learning workloads to SageMaker. He previously worked at financial services institutes developing and operating systems at scale. Outside of work, he enjoys ultra endurance running and cycling.

Sarah Ostermeier is a Technical Product Marketing Manager at Comet. She specializes in bringing Comet’s GenAI and ML developer products to the engineers who need them through technical content, educational resources, and product messaging. She has previously worked as an ML engineer, data scientist, and customer success manager, helping customers implement and scale AI solutions. Outside of work she enjoys traveling off the beaten path, writing about AI, and reading science fiction.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

SageMaker Comet 机器学习 MLOps 实验管理 模型溯源 可复现性 合规性 SageMaker AI Machine Learning Experiment Tracking Model Lineage Reproducibility Compliance Enterprise ML
相关文章