SageMaker HyperPod简化AI模型训练与部署

Training and deploying large AI models requires advanced distributed computing capabilities, but managing these distributed systems shouldn’t be complex for data scientists and machine learning (ML) practitioners. The newly released command line interface (CLI) and software development kit (SDK) for Amazon SageMaker HyperPod simplify how you can use the service’s distributed training and inference capabilities.

The SageMaker HyperPod CLI provides data scientists with an intuitive command-line experience, abstracting away the underlying complexity of distributed systems. Built on top of the SageMaker HyperPod SDK, the CLI offers straightforward commands for common workflows like launching training or fine-tuning jobs, deploying inference endpoints, and monitoring cluster performance. This makes it ideal for quick experimentation and iteration.

For more advanced use cases requiring fine-grained control, the SageMaker HyperPod SDK enables programmatic access to customize your ML workflows. Developers can use the SDK’s Python interface to precisely configure training and deployment parameters while maintaining the simplicity of working with familiar Python objects.

In this post, we demonstrate how to use both the CLI and SDK to train and deploy large language models (LLMs) on SageMaker HyperPod. We walk through practical examples of distributed training using Fully Sharded Data Parallel (FSDP) and model deployment for inference, showcasing how these tools streamline the development of production-ready generative AI applications.

Prerequisites

To follow the examples in this post, you must have the following prerequisites:

Amazon Simple Storage Service

Amazon FSx for Lustre

AWS Command Line Interface

Amazon Elastic Kubernetes Service

Creating a SageMaker HyperPod cluster with Amazon EKS orchestration

cluster creation time

separately

Because the use cases that we demonstrate are about training and deploying LLMs with the SageMaker HyperPod CLI and SDK, you must also install the following Kubernetes operators in the cluster:

HyperPod training operator

Installing the training operator

HyperPod inference operator

Setting up your HyperPod clusters for model deployment

corresponding notebook

Install the SageMaker HyperPod CLI

First, you must install the latest version of the SageMaker HyperPod CLI and SDK (the examples in this post are based on version 3.1.0). From the local environment, run the following command (you can also install in a Python virtual environment):

# Install the HyperPod CLI and SDKpip install sagemaker-hyperpod

This command sets up the tools needed to interact with SageMaker HyperPod clusters. For an existing installation, make sure you have the latest version of the package installed (sagemaker-hyperpod>=3.1.0) to be able to use the relevant set of features. To verify if the CLI is installed correctly, you can run the hyp command and check the outputs:

# Check if the HyperPod CLI is correctly installedhyp

The output will be similar to the following, and includes instructions on how to use the CLI:

Usage: hyp [OPTIONS] COMMAND [ARGS]...Options:  --help  Show this message and exit.Commands:  create               Create endpoints or pytorch jobs.  delete               Delete endpoints or pytorch jobs.  describe             Describe endpoints or pytorch jobs.  get-cluster-context  Get context related to the current set cluster.  get-logs             Get pod logs for endpoints or pytorch jobs.  get-monitoring       Get monitoring configurations for Hyperpod cluster.  get-operator-logs    Get operator logs for endpoints.  invoke               Invoke model endpoints.  list                 List endpoints or pytorch jobs.  list-cluster         List SageMaker Hyperpod Clusters with metadata.  list-pods            List pods for endpoints or pytorch jobs.  set-cluster-context  Connect to a HyperPod EKS cluster.

For more information on CLI usage and the available commands and respective parameters, refer to the CLI reference documentation.

Set the cluster context

The SageMaker HyperPod CLI and SDK use the Kubernetes API to interact with the cluster. Therefore, make sure the underlying Kubernetes Python client is configured to execute API calls against your cluster by setting the cluster context.

Use the CLI to list the clusters available in your AWS account:

# List all HyperPod clusters in your AWS accounthyp list-cluster[    {        "Cluster": "ml-cluster",        "Instances": [            {                "InstanceType": "ml.g5.8xlarge",                "TotalNodes": 8,                "AcceleratorDevicesAvailable": 8,                "NodeHealthStatus=Schedulable": 8,                "DeepHealthCheckStatus=Passed": "N/A"            },            {                "InstanceType": "ml.m5.12xlarge",                "TotalNodes": 1,                "AcceleratorDevicesAvailable": "N/A",                "NodeHealthStatus=Schedulable": 1,                "DeepHealthCheckStatus=Passed": "N/A"            }        ]    }]

Set the cluster context specifying the cluster name as input (in our case, we use ml-cluster as <cluster_name>):

# Set the cluster context for subsequent commandshyp set-cluster-context --cluster-name <cluster_name>

Train models with the SageMaker HyperPod CLI and SDK

The SageMaker HyperPod CLI provides a straightforward way to submit PyTorch model training and fine-tuning jobs to a SageMaker HyperPod cluster. In the following example, we schedule a Meta Llama 3.1 8B model training job with FSDP.

The CLI executes training using the HyperPodPyTorchJob Kubernetes custom resource, which is implemented by the HyperPod training operator, that needs to be installed in the cluster as discussed in the prerequisites section.

First, clone the awsome-distributed-training repository and create the Docker image that you will use for the training job:

cd ~git clone https://github.com/aws-samples/awsome-distributed-training/cd awsome-distributed-training/3.test_cases/pytorch/FSDP

Then, log in to the Amazon Elastic Container Registry (Amazon ECR) to pull the base image and build the new container:

export AWS_REGION=$(aws ec2 describe-availability-zones --output text --query 'AvailabilityZones[0].[RegionName]')export ACCOUNT=$(aws sts get-caller-identity --query Account --output text)export REGISTRY=${ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com/docker build -f Dockerfile -t ${REGISTRY}fsdp:pytorch2.7.1 .

The Dockerfile in the awsome-distributed-training repository referenced in the preceding code already contains the HyperPod elastic agent, which orchestrates lifecycles of training workers on each container and communicates with the HyperPod training operator. If you’re using a different Dockerfile, install the HyperPod elastic agent following the instructions in HyperPod elastic agent.

Next, create a new registry for your training image if needed and push the built image to it:

# Create registry if neededREGISTRY_COUNT=$(aws ecr describe-repositories | grep "fsdp" | wc -l)if [ "$REGISTRY_COUNT" -eq 0 ]; then    aws ecr create-repository --repository-name fsdpfi# Login to registryecho "Logging in to $REGISTRY ..."aws ecr get-login-password | docker login --username AWS --password-stdin $REGISTRY# Push image to registrydocker image push ${REGISTRY}fsdp:pytorch2.7.1

After you have successfully created the Docker image, you can submit the training job using the SageMaker HyperPod CLI.

Internally, the SageMaker HyperPod CLI will use the Kubernetes Python client to build a HyperPodPyTorchJob custom resource and then create it on the Kubernetes the cluster.

You can modify the CLI command for other Meta Llama configurations by exchanging the --args to the desired arguments and values; examples can be found in the Kubernetes manifests in the awsome-distributed-training repository.

In the given configuration, the training job will write checkpoints to /fsx/checkpoints on the FSx for Lustre PVC.

hyp create hyp-pytorch-job \    --job-name fsdp-llama3-1-8b \    --image ${REGISTRY}fsdp:pytorch2.7.1 \    --command '[        hyperpodrun,        --tee=3,        --log_dir=/tmp/hyperpod,        --nproc_per_node=1,        --nnodes=8,        /fsdp/train.py    ]' \    --args '[        --max_context_width=8192,        --num_key_value_heads=8,        --intermediate_size=14336,        --hidden_width=4096,        --num_layers=32,        --num_heads=32,        --model_type=llama_v3,        --tokenizer=hf-internal-testing/llama-tokenizer,        --checkpoint_freq=50,        --validation_freq=25,        --max_steps=50,        --checkpoint_dir=/fsx/checkpoints,        --dataset=allenai/c4,        --dataset_config_name=en,        --resume_from_checkpoint=/fsx/checkpoints,        --train_batch_size=1,        --val_batch_size=1,        --sharding_strategy=full,        --offload_activations=1    ]' \    --environment '{"PYTORCH_CUDA_ALLOC_CONF": "max_split_size_mb:32"}' \    --pull-policy "IfNotPresent" \    --instance-type ml.g5.8xlarge \    --node-count 8 \    --tasks-per-node 1 \    --deep-health-check-passed-nodes-only false \    --max-retry 3 \    --volume name=shmem,type=hostPath,mount_path=/dev/shm,path=/dev/shm,read_only=false \    --volume name=fsx,type=pvc,mount_path=/fsx,claim_name=fsx-claim,read_only=false

The hyp create hyp-pytorch-job command supports additional arguments, which can be discovered by running the following:

hyp create hyp-pytorch-job --help

The preceding example code contains the following relevant arguments:

--command

--args

hyperpodrun

HyperPod elastic agent

--environment

--max-retry

Using the training operator to run jobs

--volume

If successful, the command will output the following:

Using version: 1.02025-08-12 10:03:03,270 - sagemaker.hyperpod.training.hyperpod_pytorch_job - INFO - Successfully submitted HyperPodPytorchJob 'fsdp-llama3-1-8b'!

You can observe the status of the training job through the CLI. Running hyp list hyp-pytorch-job will show the status first as Created and then as Running after the containers have been started:

NAME                          NAMESPACE           STATUS         AGE            --------------------------------------------------------------------------------fsdp-llama3-1-8b              default             Running        6m

To list the pods that are created by this training job, run the following command:

hyp list-pods hyp-pytorch-job --job-name fsdp-llama3-1-8bPods for job: fsdp-llama3-1-8bPOD NAME                                          NAMESPACE           ----------------------------------------------------------------------fsdp-llama3-1-8b-pod-0                            default             fsdp-llama3-1-8b-pod-1                            default             fsdp-llama3-1-8b-pod-2                            default         fsdp-llama3-1-8b-pod-3                            default         fsdp-llama3-1-8b-pod-4                            default         fsdp-llama3-1-8b-pod-5                            default         fsdp-llama3-1-8b-pod-6                            default        fsdp-llama3-1-8b-pod-7                            default

You can observe the logs of one of the training pods that get spawned by running the following command:

hyp get-logs hyp-pytorch-job --pod-name fsdp-llama3-1-8b-pod-0 \--job-name fsdp-llama3-1-8b...2025-08-12T14:59:25.069208138Z [HyperPodElasticAgent] 2025-08-12 14:59:25,069 [INFO] [rank0-restart0] /usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/agent/server/api.py:685: [default] Starting worker group 2025-08-12T14:59:25.069301320Z [HyperPodElasticAgent] 2025-08-12 14:59:25,069 [INFO] [rank0-restart0] /usr/local/lib/python3.10/dist-packages/hyperpod_elastic_agent/hyperpod_elastic_agent.py:221: Starting workers with worker spec worker_group.spec=WorkerSpec(role='default', local_world_size=1, rdzv_handler=<hyperpod_elastic_agent.rendezvous.hyperpod_rendezvous_backend.HyperPodRendezvousBackend object at 0x7f0970a4dc30>, fn=None, entrypoint='/usr/bin/python3', args=('-u', '/fsdp/train.py', '--max_context_width=8192', '--num_key_value_heads=8', '--intermediate_size=14336', '--hidden_width=4096', '--num_layers=32', '--num_heads=32', '--model_type=llama_v3', '--tokenizer=hf-internal-testing/llama-tokenizer', '--checkpoint_freq=50', '--validation_freq=50', '--max_steps=100', '--checkpoint_dir=/fsx/checkpoints', '--dataset=allenai/c4', '--dataset_config_name=en', '--resume_from_checkpoint=/fsx/checkpoints', '--train_batch_size=1', '--val_batch_size=1', '--sharding_strategy=full', '--offload_activations=1'), max_restarts=3, monitor_interval=0.1, master_port=None, master_addr=None, local_addr=None)... 2025-08-12T14:59:30.264195963Z [default0]:2025-08-12 14:59:29,968 [INFO] **main**: Creating Model 2025-08-12T15:00:51.203541576Z [default0]:2025-08-12 15:00:50,781 [INFO] **main**: Created model with total parameters: 7392727040 (7.39 B) 2025-08-12T15:01:18.139531830Z [default0]:2025-08-12 15:01:18 I [checkpoint.py:79] Loading checkpoint from /fsx/checkpoints/llama_v3-24steps ... 2025-08-12T15:01:18.833252603Z [default0]:2025-08-12 15:01:18,081 [INFO] **main**: Wrapped model with FSDP 2025-08-12T15:01:18.833290793Z [default0]:2025-08-12 15:01:18,093 [INFO] **main**: Created optimizer

We elaborate on more advanced debugging and observability features at the end of this section.

Alternatively, if you prefer a programmatic experience and more advanced customization options, you can submit the training job using the SageMaker HyperPod Python SDK. For more information, refer to the SDK reference documentation. The following code will yield the equivalent training job submission to the preceding CLI example:

import osfrom sagemaker.hyperpod.training import HyperPodPytorchJobfrom sagemaker.hyperpod.training import ReplicaSpec, Template, VolumeMounts, Spec, Containers, Resources, RunPolicy, Volumes, HostPath, PersistentVolumeClaimfrom sagemaker.hyperpod.common.config import MetadataREGISTRY = os.environ['REGISTRY']# Define job specificationsnproc_per_node = "1"  # Number of processes per nodereplica_specs = [    ReplicaSpec(        name = "pod",  # Replica name        replicas = 8,        template = Template(            spec = Spec(                containers =                [                    Containers(                        # Container name                        name="fsdp-training-container",                                                  # Training image                        image=f"{REGISTRY}fsdp:pytorch2.7.1",                          # Volume mounts                        volume_mounts=[                            VolumeMounts(                                name="fsx",                                mount_path="/fsx"                            ),                            VolumeMounts(                                name="shmem",                                 mount_path="/dev/shm"                            )                        ],                        env=[                                {"name": "PYTORCH_CUDA_ALLOC_CONF", "value": "max_split_size_mb:32"},                            ],                                                # Image pull policy                        image_pull_policy="IfNotPresent",                        resources=Resources(                            requests={"nvidia.com/gpu": "1"},                              limits={"nvidia.com/gpu": "1"},                           ),                        # Command to run                        command=[                            "hyperpodrun",                            "--tee=3",                            "--log_dir=/tmp/hyperpod",                            "--nproc_per_node=1",                            "--nnodes=8",                            "/fsdp/train.py"                        ],                          # Script arguments                        args = [                            '--max_context_width=8192',                            '--num_key_value_heads=8',                            '--intermediate_size=14336',                            '--hidden_width=4096',                            '--num_layers=32',                            '--num_heads=32',                            '--model_type=llama_v3',                            '--tokenizer=hf-internal-testing/llama-tokenizer',                            '--checkpoint_freq=2',                            '--validation_freq=25',                            '--max_steps=50',                            '--checkpoint_dir=/fsx/checkpoints',                            '--dataset=allenai/c4',                            '--dataset_config_name=en',                            '--resume_from_checkpoint=/fsx/checkpoints',                            '--train_batch_size=1',                            '--val_batch_size=1',                            '--sharding_strategy=full',                            '--offload_activations=1'                        ]                    )                ],                volumes = [                    Volumes(                        name="fsx",                        persistent_volume_claim=PersistentVolumeClaim(                            claim_name="fsx-claim",                            read_only=False                        ),                    ),                    Volumes(                        name="shmem",                        host_path=HostPath(path="/dev/shm"),                    )                ],                node_selector={                    "node.kubernetes.io/instance-type": "ml.g5.8xlarge",                },            )        ),    )]run_policy = RunPolicy(clean_pod_policy="None", job_max_retry_count=3)  # Create and start the PyTorch jobpytorch_job = HyperPodPytorchJob(    # Job name    metadata = Metadata(        name="fsdp-llama3-1-8b",             namespace="default",    ),    # Processes per node    nproc_per_node = nproc_per_node,       # Replica specifications    replica_specs = replica_specs,        )# Launch the jobpytorch_job.create()

Debugging training jobs

In addition to monitoring the training pod logs as described earlier, there are several other useful ways of debugging training jobs:

--debug True

hyp list hyp-pytorch-job

hyp describe hyp-pytorch-job —job-name fsdp-llama3-1-8b

HyperPod observability stack

hyp get-monitoring --grafana

hyp get-monitoring --prometheus

kubectl exec -it

-- nvtop

nvtop

kubectl exec -it

-- /bin/bash

kubectl get pods -n aws-hyperpod | grep hp-training-controller-manager

kubectl logs -n aws-hyperpod

Deploy models with the SageMaker HyperPod CLI and SDK

The SageMaker HyperPod CLI provides commands to quickly deploy models to your SageMaker HyperPod cluster for inference. You can deploy both foundation models (FMs) available on Amazon SageMaker JumpStart as well as custom models with artifacts that are stored on Amazon S3 or FSx for Lustre file systems.

This functionality will automatically deploy the chosen model to the SageMaker HyperPod cluster through Kubernetes custom resources, which are implemented by the HyperPod inference operator, that needs to be installed in the cluster as discussed in the prerequisites section. It is optionally possible to automatically create a SageMaker inference endpoint as well as an Application Load Balancer (ALB), which can be used directly using HTTPS calls with a generated TLS certificate to invoke the model.

Deploy SageMaker JumpStart models

You can deploy an FM that is available on SageMaker JumpStart with the following command:

hyp create hyp-jumpstart-endpoint \  --model-id deepseek-llm-r1-distill-qwen-1-5b \  --instance-type ml.g5.8xlarge \  --endpoint-name \  --tls-certificate-output-s3-uri s3://<certificate-bucket>/ \  --namespace default

The preceding code includes the following parameters:

--model-id

DeepSeek R1-distilled version of Qwen 1.5B

--instance-type

--endpoint-name

--tls-certificate-output-s3-uri

HyperPod inference operator IAM role

--namespace

default

The CLI supports more advanced deployment configurations, including auto scaling, through additional parameters, which can be viewed by running the following command:

hyp create hyp-jumpstart-endpoint --help

If successful, the command will output the following:

Creating JumpStart model and sagemaker endpoint. Endpoint name: deepseek-distill-qwen-endpoint-cli. The process may take a few minutes...

After a few minutes, both the ALB and the SageMaker inference endpoint will be available, which can be observed through the CLI. Running hyp list hyp-jumpstart-endpoint will show the status first as DeploymentInProgress and then as DeploymentComplete when the endpoint is ready to be used:

| name                               | namespace   | labels   | status             ||------------------------------------|-------------|----------|--------------------|| deepseek-distill-qwen-endpoint-cli | default     |          | DeploymentComplete |

To get additional visibility into the deployment pod, run the following commands to find the pod name and view the corresponding logs:

hyp list-pods hyp-jumpstart-endpoint --namespace <namespace>hyp get-logs hyp-jumpstart-endpoint --namespace <namespace> --pod-name <model-pod-name>

The output will look similar to the following:

2025-08-12T15:53:14.042031963Z WARN  PyProcess W-195-model-stderr: Capturing CUDA graph shapes: 100%|??????????| 35/35 [00:18<00:00,  1.63it/s]2025-08-12T15:53:14.042257357Z WARN  PyProcess W-195-model-stderr: Capturing CUDA graph shapes: 100%|??????????| 35/35 [00:18<00:00,  1.94it/s]2025-08-12T15:53:14.042297298Z INFO  PyProcess W-195-model-stdout: INFO 08-12 15:53:14 llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 26.18 seconds2025-08-12T15:53:15.215357997Z INFO  PyProcess Model [model] initialized.2025-08-12T15:53:15.219205375Z INFO  WorkerThread Starting worker thread WT-0001 for model model (M-0001, READY) on device gpu(0)2025-08-12T15:53:15.221591827Z INFO  ModelServer Initialize BOTH server with: EpollServerSocketChannel.2025-08-12T15:53:15.231404670Z INFO  ModelServer BOTH API bind to: http://0.0.0.0:8080

You can invoke the SageMaker inference endpoint you created through the CLI by running the following command:

hyp invoke hyp-jumpstart-endpoint \    --endpoint-name deepseek-distill-qwen-endpoint-cli \           --body '{"inputs":"What is the capital of USA?"}'

You will get an output similar to the following:

{"generated_text": " What is the capital of France? What is the capital of Japan? What is the capital of China? What is the capital of Germany? What is"}

Alternatively, if you prefer a programmatic experience and advanced customization options, you can use the SageMaker HyperPod Python SDK. The following code will yield the equivalent deployment to the preceding CLI example:

from sagemaker.hyperpod.inference.config.hp_jumpstart_endpoint_config import Model, Server, SageMakerEndpoint, TlsConfigfrom sagemaker.hyperpod.inference.hp_jumpstart_endpoint import HPJumpStartEndpointmodel=Model(    model_id='deepseek-llm-r1-distill-qwen-1-5b',)server=Server(    instance_type='ml.g5.8xlarge',)endpoint_name=SageMakerEndpoint(name='deepseek-distill-qwen-endpoint-cli')tls_config=TlsConfig(tls_certificate_output_s3_uri='s3://<certificate-bucket>')js_endpoint=HPJumpStartEndpoint(    model=model,    server=server,    sage_maker_endpoint=endpoint_name,    tls_config=tls_config,    namespace="default")js_endpoint.create()

Deploy custom models

You can also use the CLI to deploy custom models with model artifacts stored on either Amazon S3 or FSx for Lustre. This is useful for models that have been fine-tuned on custom data. You must provide the storage location of the model artifacts as well as a container image for inference that is compatible with the model artifacts and SageMaker inference endpoints. In the following example, we deploy a TinyLlama 1.1B model from Amazon S3 using the DJL Large Model Inference container image.

In preparation, download the model artifacts locally and push them to an S3 bucket:

# Install huggingface-hub if not present on your machinepip install huggingface-hub# Download modelhf download TinyLlama/TinyLlama-1.1B-Chat-v1.0 --local-dir ./tinyllama-1.1b-chat# Upload to S3aws s3 cp ./tinyllama s3://<model-bucket>/models/tinyllama-1.1b-chat/ --recursive

Now you can deploy the model with the following command:

hyp create hyp-custom-endpoint \    --endpoint-name my-custom-tinyllama-endpoint \    --model-name tinyllama \    --model-source-type s3 \    --model-location models/tinyllama-1.1b-chat/ \    --s3-bucket-name <model-bucket> \    --s3-region <model-bucket-region> \    --instance-type ml.g5.8xlarge \    --image-uri 763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.33.0-lmi15.0.0-cu128 \    --container-port 8080 \    --model-volume-mount-name modelmount \    --tls-certificate-output-s3-uri s3://<certificate-bucket>/ \    --namespace default

The preceding code contains the following key parameters:

--model-name

--model-source-type

fsx

s3

--model-location

--s3-bucket-name

s3-region

--instance-type

--endpoint-name

--namespace

--tls-certificate

Similar to SageMaker JumpStart model deployment, the CLI supports more advanced deployment configurations, including auto scaling, through additional parameters, which you can view by running the following command:

hyp create hyp-custom-endpoint --help

If successful, the command will output the following:

Creating sagemaker model and endpoint. Endpoint name: my-custom-tinyllama-endpoint. The process may take a few minutes...

After a few minutes, both the ALB and the SageMaker inference endpoint will be available, which you can observe through the CLI. Running hyp list hyp-custom-endpoint will show the status first as DeploymentInProgress and as DeploymentComplete when the endpoint is ready to be used:

| name                         | namespace   | labels   | status               ||------------------------------|-------------|----------|----------------------|| my-custom-tinyllama-endpoint | default     |          | DeploymentComplete   |

To get additional visibility into the deployment pod, run the following commands to find the pod name and view the corresponding logs:

hyp list-pods hyp-custom-endpoint --namespace <namespace>hyp get-logs hyp-custom-endpoint --namespace <namespace> --pod-name <model-pod-name>

The output will look similar to the following:

│ INFO  PyProcess W-196-model-stdout: INFO 08-12 16:00:36 [monitor.py:33] torch.compile takes 29.18 s in total                                                          ││ INFO  PyProcess W-196-model-stdout: INFO 08-12 16:00:37 [kv_cache_utils.py:634] GPU KV cache size: 809,792 tokens                                                     ││ INFO  PyProcess W-196-model-stdout: INFO 08-12 16:00:37 [kv_cache_utils.py:637] Maximum concurrency for 2,048 tokens per request: 395.41x                             ││ INFO  PyProcess W-196-model-stdout: INFO 08-12 16:00:59 [gpu_model_runner.py:1626] Graph capturing finished in 22 secs, took 0.37 GiB                                 ││ INFO  PyProcess W-196-model-stdout: INFO 08-12 16:00:59 [core.py:163] init engine (profile, create kv cache, warmup model) took 59.39 seconds                         ││ INFO  PyProcess W-196-model-stdout: INFO 08-12 16:00:59 [core_client.py:435] Core engine process 0 ready.                                                             ││ INFO  PyProcess Model [model] initialized.                                                                                                                            ││ INFO  WorkerThread Starting worker thread WT-0001 for model model (M-0001, READY) on device gpu(0)                                                                    ││ INFO  ModelServer Initialize BOTH server with: EpollServerSocketChannel.                                                                                              ││ INFO  ModelServer BOTH API bind to: http://0.0.0.0:8080

You can invoke the SageMaker inference endpoint you created through the CLI by running the following command:

hyp invoke hyp-custom-endpoint \    --endpoint-name my-custom-tinyllama-endpoint \           --body '{"inputs":"What is the capital of USA?"}'

You will get an output similar to the following:

{"generated_text": " What is the capital of France? What is the capital of Japan? What is the capital of China? What is the capital of Germany? What is"}

Alternatively, you can deploy using the SageMaker HyperPod Python SDK. The following code will yield the equivalent deployment to the preceding CLI example:

from sagemaker.hyperpod.inference.config.hp_endpoint_config import S3Storage, ModelSourceConfig, TlsConfig, EnvironmentVariables, ModelInvocationPort, ModelVolumeMount, Resources, Workerfrom sagemaker.hyperpod.inference.hp_endpoint import HPEndpointmodel_source_config = ModelSourceConfig(    model_source_type='s3',    model_location="models/tinyllama-1.1b-chat/",    s3_storage=S3Storage(        bucket_name='<model-bucket>',        region='<model-bucket-region>',    ),)worker = Worker(    image='763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.33.0-lmi15.0.0-cu128',    model_volume_mount=ModelVolumeMount(        name='modelmount',    ),    model_invocation_port=ModelInvocationPort(container_port=8080),    resources=Resources(            requests={"cpu": "30000m", "nvidia.com/gpu": 1, "memory": "100Gi"},            limits={"nvidia.com/gpu": 1}    ),)tls_config = TlsConfig(tls_certificate_output_s3_uri='s3://<certificate-bucket>/')custom_endpoint = HPEndpoint(    endpoint_name='my-custom-tinyllama-endpoint',    instance_type='ml.g5.8xlarge',    model_name='tinyllama',      tls_config=tls_config,    model_source_config=model_source_config,    worker=worker,)custom_endpoint.create()

Debugging inference deployments

In addition to the monitoring of the inference pod logs, there are several other useful ways of debugging inference deployments:

hyp get-operator-logs

—since-hours 0.5

hyp list

hyp describe

--name

HyperPod observability stack

hyp get-monitoring --grafana

hyp get-monitoring --prometheus

kubectl exec -it

-- nvtop

nvtop

kubectl exec -it

-- /bin/bash

For more information on the inference deployment features in SageMaker HyperPod, see Amazon SageMaker HyperPod launches model deployments to accelerate the generative AI model development lifecycle and Deploying models on Amazon SageMaker HyperPod.

Clean up

To delete the training job from the corresponding example, use the following CLI command:

hyp delete hyp-pytorch-job --job-name fsdp-llama3-1-8b

To delete the model deployments from the inference example, use the following CLI commands for SageMaker JumpStart and custom model deployments, respectively:

hyp delete hyp-jumpstart-endpoint --name deepseek-distill-qwen-endpoint-clihyp delete hyp-custom-endpoint --name my-custom-tinyllama-endpoint

To avoid incurring ongoing costs for the instances running in your cluster, you can scale down the instances or delete instances.

Conclusion

The new SageMaker HyperPod CLI and SDK can significantly streamline the process of training and deploying large-scale AI models. Through the examples in this post, we’ve demonstrated how these tools provide the following benefits:

Simplified workflows

Flexible development options

Comprehensive observability

Production-ready deployment

Getting started with these tools is as simple as installing the sagemaker-hyperpod package. The SageMaker HyperPod CLI and SDK provide the right level of abstraction for both data scientists looking to quickly experiment with distributed training and ML engineers building production systems.

For more information about SageMaker HyperPod and these development tools, refer to the SageMaker HyperPod CLI and SDK documentation or explore the example notebooks.

About the authors

Giuseppe Angelo Porcelli is a Principal Machine Learning Specialist Solutions Architect for Amazon Web Services. With several years of software engineering and an ML background, he works with customers of any size to understand their business and technical needs and design AI and ML solutions that make the best use of the AWS Cloud and the Amazon Machine Learning stack. He has worked on projects in different domains, including MLOps, computer vision, and NLP, involving a broad set of AWS services. In his free time, Giuseppe enjoys playing football.

Shweta Singh is a Senior Product Manager in the Amazon SageMaker Machine Learning platform team at AWS, leading the SageMaker Python SDK. She has worked in several product roles in Amazon for over 5 years. She has a Bachelor of Science degree in Computer Engineering and a Masters of Science in Financial Engineering, both from New York University.

Nicolas Jourdan is a Specialist Solutions Architect at AWS, where he helps customers unlock the full potential of AI and ML in the cloud. He holds a PhD in Engineering from TU Darmstadt in Germany, where his research focused on the reliability, concept drift detection, and MLOps of industrial ML applications. Nicolas has extensive hands-on experience across industries, including autonomous driving, drones, and manufacturing, having worked in roles ranging from research scientist to engineering manager. He has contributed to award-winning research, holds patents in object detection and anomaly detection, and is passionate about applying cutting-edge AI to solve complex real-world problems.

Prerequisites

Install the SageMaker HyperPod CLI

Set the cluster context

Train models with the SageMaker HyperPod CLI and SDK

Debugging training jobs

Deploy models with the SageMaker HyperPod CLI and SDK

Deploy SageMaker JumpStart models

Deploy custom models

Debugging inference deployments

Clean up

Conclusion

About the authors

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签