利用Amazon SageMaker Canvas填补空气质量数据缺失

Air pollution remains one of Africa’s most pressing environmental health crises, causing widespread illness across the continent. Organizations like sensors.AFRICA have deployed hundreds of air quality sensors to address this challenge, but face a critical data problem: significant gaps in PM2.5 (particulate matter with diameter less than or equal to 2.5 micrometers) measurement records because of power instability and connectivity issues in high-risk regions where physical maintenance is limited. Missing data in PM2.5 datasets reduces statistical power and introduces bias into parameter estimates, leading to unreliable trend detection and flawed conclusions about air quality patterns. These data gaps ultimately compromise evidence-based decision-making for pollution control strategies, health impact assessments, and regulatory compliance.

In this post, we demonstrate the time-series forecasting capability of Amazon SageMaker Canvas, a low-code no-code (LCNC) machine learning (ML) platform to predict PM2.5 from incomplete datasets. PM2.5 exposure contributes to millions of premature deaths globally through cardiovascular disease, respiratory illness, and systemic health effects, making accurate air quality forecasting a critical public health tool. A key advantage of the forecasting capability of SageMaker Canvas is its robust handling of incomplete data. Traditional air quality monitoring systems often require complete datasets to function properly, meaning they can’t be relied on when sensors malfunction or require maintenance. In contrast, SageMaker Canvas can generate reliable predictions even when faced with gaps in sensor data. This resilience enables continuous operation of air quality monitoring networks despite inevitable sensor failures or maintenance periods, eliminating costly downtime and data gaps. Environmental agencies and public health officials benefit from uninterrupted access to critical air quality information, enabling timely pollution alerts and more comprehensive long-term analysis of air quality trends. By maintaining operational continuity even with imperfect data inputs, SageMaker Canvas significantly enhances the reliability and practical utility of environmental monitoring systems.

In this post, we provide a data imputation solution using Amazon SageMaker AI, AWS Lambda, and AWS Step Functions. This solution is designed for environmental analysts, public health officials, and business intelligence professionals who need reliable PM2.5 data for trend analysis, reporting, and decision-making. We sourced our sample training dataset from openAFRICA. Our solution predicts PM2.5 values using time-series forecasting. The sample training dataset contained over 15 million records from March 2022 to Oct 2022 in various parts of Kenya and Nigeria—data coming from 23 sensor devices from 15 unique locations. The sample code and workflows can be adapted to create prediction models for your PM2.5 datasets. See our solution’s README for detailed instructions.

Solution overview

The solution consists of two main ML components: a training workflow and an inference workflow. These workflows are built using the following services:

Batch Transform for inference with Amazon SageMaker AI

At a high level, the solution works by taking a set of PM2.5 data with gaps and predicts the missing values within the range of plus or minus 4.875 micrograms per cubic meter of the actual PM2.5 concentration. It does this by first training a model on the data using inputs for the specific schema and a historical set of values from the user to guide the training process, which is completed with SageMaker Canvas. After the model is trained on a representative dataset and schema, SageMaker Canvas exports the model for use with batch processing. The Step Functions orchestration calls a Lambda function every 24 hours that takes a dataset of new sensor data that has gaps and initiates a SageMaker batch transform job to predict the missing values. The batch transform job processes the entire dataset at once, and the Lambda function then updates the existing dataset with the results. The new completed dataset with predicted values can now be distributed to public health decision-makers who need complete datasets to effectively analyze the patterns of PM2.5 data.

We dive into each of these steps in later sections of this post.

Solution walkthrough

The following diagram shows our solution architecture:

Let’s explore the architecture step by step:

Amazon EventBridge Scheduler

Amazon Virtual Private Cloud (Amazon VPC)

Amazon Aurora PostgreSQL-Compatible Edition

Amazon Simple Storage Service (Amazon S3)

predicted

true

Prerequisites

To implement the PM2.5 data imputation solution, you must have the following:

AWS Identity and Access Management (IAM)

Amazon CloudWatch

AWS CloudFormation

Amazon Virtual Private Cloud (VPC)

Amazon EventBridge

AWS Systems Manager Parameter Store

AWS Command Line Interface (AWS CLI)

Python

AWS Cloud Development Kit (AWS CDK)

Git

README

Deploy the solution

You will run the following steps to complete the deployment:

Build your air quality prediction model

Amazon SageMaker Model Registry

Step 1: Deploy AWS infrastructure and upload air quality sensor data

Complete the following steps to deploy the PM2.5 data imputation solution AWS Infrastructure and upload air quality sensor data to Amazon Aurora RDS:

Clone the repository to your local desktop environment using the following command:

git clone git@github.com:aws-samples/sample-empowering-air-quality-research-secure-machine-learning-predictive-analytics.git

Change to the project directory:

cd <BASE_PROJECT_FOLDER>

README

Model Setup for Batch Transform Inference

Step 2: Build your air quality prediction model

After you create the SageMaker AI domain and the SageMaker AI user profile as part of the CDK deployment steps, follow these steps to build your air quality prediction model

Configure your SageMaker Canvas application

Admin

Configurations

Domains

App Configurations

Canvas

Edit

Canvas storage configuration

Encryption

aws/s3

ML Ops Configuration

Enable Model Registry registration permissions for this user profile.

Local file upload configuration

Enable local file upload.

Submit

Applications and IDEs

Canvas

Open Canvas

Create and register your prediction model

In this phase, you develop a prediction model using your historical air quality sensor data.

The preceding architecture diagram illustrates the end-to-end process for training the SageMaker Canvas prediction model, registering that model and creating a SageMaker model for running inference on newly found PM2.5 data gaps. The training process starts by extracting air quality sensor dataset from the database. The dataset is imported into SageMaker Canvas for predictive analysis. This training dataset is transformed and prepared through data wrangling steps implemented by SageMaker Canvas for building and training ML models.

Prepare data

Our solution supports a SageMaker Canvas model trained for a single-target variable prediction based on historical data and performs corresponding data imputation for PM2.5 data gaps. To train your model for predictive analysis, follow the comprehensive End to End Machine Learning workflow in the AWS Canvas Immersion Day workshop, adapting each step to prepare your air quality sensor dataset. Begin with the standard workflow until you reach the data preparation section. Here, you can make several customizations:

Filter dataset for single-target value prediction:

PM2.5

Clean sensor data

The following image shows our data wrangling Data Flow implemented using above guidance:

Data Wrangler > Data Flow

Review generated insights and remove irrelevant data

generated insights and analyses

in GitHub

The following image shows our data wrangling Analyses obtained with implementing the above guidance:

Data Wrangler > Analyses

Training your prediction model

After completing your data preparation, proceed to the Train the Model section of the workshop and continue with these specifications:

Select problem type

Predictive Analysis

time series forecasting

Define target column

Value

Build configuration

Standard Build

What happens when you build a model

How custom models work

By following these steps, you can create a model optimized for PM2.5 dataset predictive analysis, capable of generating valuable insights. Note that SageMaker Canvas supports retraining the ML model for updated PM2.5 datasets.

Evaluate the model

After training your model, proceed to Evaluate the model and review column impact, root mean square error (RMSE) score and other advanced metrics to understand your model’s performance for generating predictions for PM2.5.

The following image shows our model evaluation statistics achieved.

Add the model to the registry

Once you are satisfied with your model performance, follow the steps in Register a model version to the SageMaker AI model registry. Make sure to change the approval status to Approved before continuing to run this solution. At the time of this post’s publication, the approval must be updated in Amazon SageMaker Studio.

Log out of SageMaker Canvas

After completing your work in SageMaker Canvas, you can log out or configure your application to automatically terminate the workspace instance. A workspace instance is dedicated for your use every time you launch a Canvas application, and you are billed for as long as the instance runs. Logging out or terminating the workspace instance stops the workspace instance billing. For more information, see billing and cost in SageMaker Canvas.

Step 3: Create a SageMaker model using your registered SageMaker Canvas model

In the previous steps, you created a SageMaker domain and user profile through CDK deployment (Step 1) and successfully registered your model (Step 2). Now, it’s time to create the SageMaker model in your VPC using the SageMaker Canvas model you registered. Follow Model Setup for Batch Inference and Re-Deploy with Updated Configuration sections in the code README for creating SageMaker model.

Step 4: Manage future configuration changes

The same deployment pattern applies to any future configuration modifications you might require, including:

Batch transform instance type optimizations Transform job scheduling changes

Update the relevant parameters in your configuration and run cdk deploy to propagate these changes throughout your solution architecture.

For a comprehensive list of configurable parameters and their default values, see the configuration file in the repository.

Execute cdk deploy again to update your infrastructure stack with the your model ID for batch transform operations, replacing the placeholder value initially deployed. This infrastructure-as-code approach helps ensure consistent, version-controlled updates to your data imputation workflow.

Security best practices

Security and compliance is a shared responsibility between AWS and the customer, as outlined in the Shared Responsibility Model. We encourage you to review this model for a comprehensive understanding of the respective responsibilities.

In this solution, we enhanced security by implementing encryption at rest for Amazon S3, Aurora PostgreSQL-Compatible database, and the SageMaker Canvas application. We also enabled encryption in transit by requiring SSL/TLS for all connections from the Lambda functions. We implemented secure database access by providing temporary dynamic credentials through IAM authentication for Amazon RDS, eliminating the need for static passwords. Each Lambda function operates with least privilege access, receiving only the minimal permissions required for its specific function. Finally, we deployed the Lambda functions, Aurora PostgreSQL-Compatible instance, and SageMaker Batch Transform jobs in private subnets of the VPC that do not traverse the public internet. This private network architecture is enabled through VPC endpoints for Amazon S3, SageMaker AI, and AWS Secrets Manager.

Results

As shown in the following image, our model, developed using SageMaker Canvas, predicts PM2.5 values with an R-squared of 0.921. Because ML models for PM2.5 prediction frequently achieve R-squared values between 0.80 and 0.98 (see this example from ScienceDirect), our solution is within the range of higher-performing PM2.5 prediction models available today. SageMaker Canvas delivers this performance through its no-code experience, automatically handling model training and optimization without requiring ML expertise from users.

Clean up

Complete the following steps to clean up your resources:

SageMaker AI

Admin

Configurations

Domains

User Profiles

User Details

Spaces and Apps

Delete

Amazon EFS

File systems

ManagedByAmazonSageMakerResource

VPC

Security

Security groups

security-group-for-inbound-nfs-<your-sagemaker-domain-id>

Inbound rules

Security groups

security-group-for-outbound-nfs-<your-sagemaker-domain-id>

Outbound rules

security-group-for-inbound-nfs-<your-sagemaker-domain-id>

security-group-for-outbound-nfs-<your-sagemaker-domain-id>

cd <BASE_PROJECT_FOLDER>/infra

AWS CloudFormation

cdk destroy

CloudFormation

Conclusion

The development of accurate PM2.5 prediction models has traditionally required extensive technical expertise, presenting significant challenges for public health researchers studying air pollution’s impact on disease outcomes. From data preprocessing and feature engineering to model selection and hyperparameter tuning, these technical requirements diverted substantial time and effort away from researchers’ core work of analyzing health outcomes and developing evidence-based interventions.SageMaker Canvas transforms this landscape by dramatically reducing the effort required to develop high-performing PM2.5 prediction models. Public health researchers can now generate accurate predictions without mastering complex ML algorithms, iterate quickly through an intuitive interface, and validate models across regions without manual hyperparameter tuning. With this shift to streamlined, accessible prediction capabilities, researchers can dedicate more time to interpreting results, understanding air pollution’s impact on community health, and developing protective interventions for vulnerable populations. The result is more efficient research that responds quickly to emerging air quality challenges and informs timely public health decisions. We invite you to implement this solution for your air quality research or ML-based predictive analytics projects. Our comprehensive deployment steps and customization guidance will help you launch quickly and efficiently. As we continue enhancing this solution, your feedback is invaluable for improving its capabilities and maximizing its impact.

About the authors

Nehal Sangoi is a Senior Technical Account Manager at Amazon Web Services. She provides strategic technical guidance to help independent software vendors plan and build solutions using AWS best practices. Connect with Nehal on LinkedIn.

Ben Peterson is a Senior Technical Account Manager with AWS. He is passionate about enhancing the developer experience and driving customer success. In his role, he provides strategic guidance on using the comprehensive AWS suite of services to modernize legacy systems, optimize performance, and unlock new capabilities. Connect with Ben on LinkedIn.

Shashank Shrivastava is a Senior Delivery Consultant and Serverless TFC member at AWS. He is passionate about helping customers and developers build modern applications on serverless architecture. As a pragmatic developer and blogger, he promotes community-driven learning and sharing of technology. His interests are software architecture, developer tools, GenAI, and serverless computing. Connect with Shashank on LinkedIn.

Akshay Singhal is a Senior Technical Account Manager at Amazon Web Services supporting Enterprise Support customers focusing on the Security ISV segment. He provides technical guidance for customers to implement AWS solutions, with expertise spanning serverless architectures and cost optimization. Outside of work, Akshay enjoys traveling, Formula 1, making short movies, and exploring new cuisines. Connect with Akshay on LinkedIn.