AWS Machine Learning Blog 09月09日
细粒度计算和内存配额分配
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

亚马逊云科技现在提供细粒度的计算和内存配额分配功能,允许客户优化 Amazon SageMaker HyperPod 集群的利用率,并公平地分配资源。管理员可以根据团队或项目的需求,分配 GPU、vCPU 和内存资源。这有助于提高资源利用率,支持多种前沿 AI 数据实验和生产流程。HyperPod 任务治理与 Kueue 集成,提供精确的资源治理和高效分配,适用于多租户集群。

💡 HyperPod 任务治理允许管理员通过 AWS 控制台或 AWS CLI 分配细粒度的 GPU、vCPU 和内存资源。管理员可以为团队或项目分配特定类型的实例、GPU、vCPU 或 Neuron 设备,并定义公平份额权重和借用策略。

🔹 资源配额可以超过当前集群的容量,这有助于规划未来的容量升级。当资源请求在定义的配额限制内且当前利用率允许时,Kubernetes 调度器将负责实际分配和放置 Pod。

🚀 HyperPod 任务治理与 Kueue 集成,Kueue 根据 KubeScheduler 的资源配额和当前利用率决定工作负载是否应该被接受。调度器将相应的节点标签和容忍度注入 PodSpec,允许 Kubernetes 将 Pod 放置在具有请求硬件配置的节点上。

📈 通过使用 HyperPod 任务治理,管理员可以定义队列准入策略(默认为先到先得或任务排序)和空闲计算分配方法(默认为先到先得或公平份额),从而优化集群利用率并支持多种 AI 用例。

We are excited to announce the general availability of fine-grained compute and memory quota allocation with HyperPod task governance. With this capability, customers can optimize Amazon SageMaker HyperPod cluster utilization on Amazon Elastic Kubernetes Service (Amazon EKS), distribute fair usage, and support efficient resource allocation across different teams or projects. For more information, see HyperPod task governance best practices for maximizing the value of SageMaker HyperPod task governance.

Compute quota management is an administrative mechanism that sets and controls compute resource limits across users, teams, and projects. It controls fair resource distribution, preventing a single entity from monopolizing cluster resources, thereby optimizing overall computational efficiency.

Because of budget constraints, customers might want to allocate compute resources across multiple teams fairly. For example, a data scientist might need some GPUs (for example, four H100 GPUs) for model development, but not the entire instance’s compute capacity. In other cases, customers have limited compute resources but many teams, and they want to fairly share compute resources across these teams, so that no idle capacity is left unused.

With HyperPod task governance, administrators can now allocate granular GPU, vCPU, and vCPU memory to teams and projects—in addition to the entire instance resources—based on their preferred strategy. Key capabilities include GPU-level quota allocation by instance type and family, or hardware type—supporting both Trainium and NVIDIA GPUs—and optional CPU and memory allocation for fine-tuned resource control. Administrators can also define the weight (or priority level) a team is given for fair-share idle compute allocation.

“With a wide variety of frontier AI data experiments and production pipelines, being able to maximize SageMaker HyperPod Cluster utilization is extremely high impact. This requires fair and controlled access to shared resources like state-of-the-art GPUs, granular hardware allocation, and more. This is exactly what HyperPod task governance is built for, and we’re excited to see AWS pushing efficient cluster utilization for a variety of AI use cases.”

– Daniel Xu, Director of Product at Snorkel AI, whose AI data technology platform empowers enterprises to build specialized AI applications by leveraging their organizational expertise at scale.

In this post, we dive deep into how to define quotas for teams or projects based on granular or instance-level allocation. We discuss different methods to define such policies, and how data scientists can schedule their jobs seamlessly with this new capability.

Solution overview

Prerequisites

To follow the examples in this blog post, you need to meet the following prerequisites:

To schedule and execute the example jobs in the Submitting Tasks section, you will also need:

Allocating granular compute and memory quota using the AWS console

Administrators are the primary persona interacting with SageMaker HyperPod task governance and are responsible for managing cluster compute allocation in alignment with the organization’s strategic priorities and goals.

Implementing this feature follows the familiar compute allocation creation workflow of HyperPod task governance. To get started, sign in to the AWS Management Console and navigate to Cluster Management under HyperPod Clusters in the Amazon SageMaker AI console. After selecting your HyperPod cluster, select the Policies tab in the cluster detail page. Navigate to Compute allocations and choose Create.

As with existing functionality, you can enable task prioritization and fair-share resource allocation through cluster policies that prioritize critical workloads and distribute idle compute across teams. By using HyperPod task governance, you can define queue admission policies (first-come-first-serve by default or task ranking) and idle compute allocation methods (first-come-first-serve or fair-share by default). In the Compute allocation section, you can create and edit allocations to distribute resources among teams, enable lending and borrowing of idle compute, configure preemption of low-priority tasks, and assign fair-share weights.

The key innovation is in the Allocations section shown in the following figure, where you’ll now find fine-grained options for resource allocation. In addition to the existing instance-level quotas, you can now directly specify GPU quotas by instance type and family or by hardware type. When you define GPU allocations, HyperPod task governance intelligently calculates appropriate default values for vCPUs and memory which are set proportionally.

For example, when allocating 2 GPUs from a single p5.48xlarge instance (which has 8 GPUs, 192 vCPUs, and 2 TiB memory) in your HyperPod cluster, HyperPod task governance assigns 48 vCPUs and 512 GiB memory as default values—which is equivalent to one quarter of the instance’s total resources. Similarly, if your HyperPod cluster contains 2 ml.g5.2xlarge instances (each with 1 GPU, 8 vCPUs, and 32 GiB memory), allocating 2 GPUs would automatically assign 16 vCPUs and 64 GiB memory from both instances as shown in the following image.

You can either proceed with these automatically calculated default values or customize the allocation by manually adjusting the vCPUs and vCPU memory fields as seen in the following image.

Amazon SageMaker HyperPod supports clusters that include CPU-based instances, GPU-based instances, and AWS Neuron-based hardware (AWS Inferentia and AWS Trainium chips). You can specify resource allocation for your team by instances, GPUs, vCPUs, vCPU memory, or Neuron devices, as shown in the following image.

Quota allocation can be more than capacity. Resources added to the compute allocation policy that aren’t currently available in the cluster represent planning for future capacity upgrades. Jobs that require these unprovisioned resources will be automatically queued and remain in a pending state until the necessary resources become available. It’s important to understand that in SageMaker HyperPod, compute allocations function as quotas, which are verified during workload scheduling to understand if a workload should be admitted or not, regardless of actual capacity availability. When resource requests are within these defined allocation limits and current utilization, the Kubernetes scheduler (kube-scheduler) handles the actual distribution and placement of pods across the HyperPod cluster nodes.

Allocating granular compute and memory quota using AWS CLI

You can also create or update compute quotas using the AWS CLI. The following is an example for creating a compute quota with only GPU count specification using the AWS CLI:

aws sagemaker \create-compute-quota \--region <aws_region> \--name "only-gpu-quota" \--cluster-arn "arn:aws:sagemaker: <aws_region>:<account_id>:cluster/<cluster_id>" \--description "test description" \--compute-quota-config "ComputeQuotaResources=[{InstanceType=ml.g6.12xlarge,Accelerators=2}],ResourceSharingConfig={Strategy=LendAndBorrow,BorrowLimit=10}" \--activation-state "Enabled" \--compute-quota-target "TeamName=onlygputeam2,FairShareWeight=10" 

Compute quotas can also be created with mixed quota types, including a certain number of instances and granular compute resources, as shown in the following example:

aws sagemaker \create-compute-quota \--region <aws_region> \--name "mix-quota-type" \--cluster-arn "arn:aws:sagemaker:<aws_region>:<account_id>:cluster/<cluster_id>" \--description "Mixed quota allocation" \--compute-quota-config "ComputeQuotaResources=[{InstanceType=ml.g6.12xlarge,Accelerators=2}, {InstanceType=ml.p5.48xlarge,Count=3}, {InstanceType=ml.c5.2xlarge,VCpu=2}],ResourceSharingConfig={Strategy=LendAndBorrow,BorrowLimit=10}" \--activation-state "Enabled" \--compute-quota-target "TeamName=mixquotatype,FairShareWeight=10" 

HyperPod task governance deep dive

SageMaker HyperPod task governance enables allocation of GPU, CPU, and memory resources by integrating with Kueue, a Kubernetes-native system for job queueing.

Kueue doesn’t replace existing Kubernetes scheduling components, but rather integrates with the kube-scheduler, such that Kueue decides whether a workload should be admitted based on the resource quotas and current utilization, and then the kube-scheduler takes care of pod placement on the nodes.

When a workload requests specific resources, Kueue selects an appropriate resource flavor based on availability, node affinity, and job priority. The scheduler then injects the corresponding node labels and tolerations into the PodSpec, allowing Kubernetes to place the pod on nodes with the requested hardware configuration. This supports precise resource governance and efficient allocation for multi-tenant clusters.

When a SageMaker HyperPod task governance compute allocation is created, Kueue creates ClusterQueues that define resource quotas and scheduling policies, along with ResourceFlavors for the selected instance types with their unique resource characteristics.

For example, the following compute allocation policy allocates ml.g6.12xlarge instances with 2 GPUs and 48 vCPUs to the onlygputeam team, implementing a LendAndBorrow strategy with an up to 50% borrowing limit. This configuration enables flexible resource sharing while maintaining priority through a fair share weight of 10 and the ability to preempt lower priority tasks from other teams.

aws sagemaker describe-compute-quota \                       --region <aws_region> \--compute-quota-id <compute_quota_id>#output{    "ComputeQuotaArn": "arn:aws:sagemaker:<aws_region>:<account_id>:compute-quota/<compute_quota_id>",    "ComputeQuotaId": "<compute_quota_id>",    "Name": "only-gpu-quota",    "Description": "Only GPU quota allocation",    "ComputeQuotaVersion": 1,    "Status": "Created",    "ClusterArn": "arn:aws:sagemaker:<aws_region>:<account_id>:cluster/<cluster_id>",    "ComputeQuotaConfig": {        "ComputeQuotaResources": [            {                "InstanceType": "ml.g6.12xlarge",                "Accelerators": 2,                "VCpu": 48.0            }        ],        "ResourceSharingConfig": {            "Strategy": "LendAndBorrow",            "BorrowLimit": 50        },        "PreemptTeamTasks": "LowerPriority"    },    "ComputeQuotaTarget": {        "TeamName": "onlygputeam",        "FairShareWeight": 10    },    "ActivationState": "Enabled",    "CreationTime": "2025-07-24T11:12:12.021000-07:00",    "CreatedBy": {},    "LastModifiedTime": "2025-07-24T11:15:45.205000-07:00",    "LastModifiedBy": {}}

The corresponding Kueue ClusterQueue is configured with the ml.g6.12xlarge flavor, providing quotas for 2 NVIDIA GPUs, 48 CPU cores, and 192 Gi memory.

kubectl describe clusterqueue hyperpod-ns-onlygputeam-clusterqueue# outputName:         hyperpod-ns-onlygputeam-clusterqueueNamespace:Labels:       sagemaker.amazonaws.com/quota-allocation-id=onlygputeam              sagemaker.amazonaws.com/sagemaker-managed-queue=trueAnnotations:  <none>API Version:  kueue.x-k8s.io/v1beta1Kind:         ClusterQueueMetadata:  ...Spec:  Cohort:  shared-pool  Fair Sharing:    Weight:  10  Flavor Fungibility:    When Can Borrow:   TryNextFlavor    When Can Preempt:  TryNextFlavor  Namespace Selector:    Match Labels:      kubernetes.io/metadata.name:  hyperpod-ns-onlygputeam  Preemption:    Borrow Within Cohort:      Policy:               LowerPriority    Reclaim Within Cohort:  Any    Within Cluster Queue:   LowerPriority  Queueing Strategy:        BestEffortFIFO  Resource Groups:    Covered Resources:      nvidia.com/gpu      aws.amazon.com/neurondevice      cpu      memory      vpc.amazonaws.com/efa    Flavors:      Name:  ml.g6.12xlarge      Resources:        Borrowing Limit:  1        Name:             nvidia.com/gpu        Nominal Quota:    2        Borrowing Limit:  0        Name:             aws.amazon.com/neurondevice        Nominal Quota:    0        Borrowing Limit:  24        Name:             cpu        Nominal Quota:    48        Borrowing Limit:  96Gi        Name:             memory        Nominal Quota:    192Gi        Borrowing Limit:  0        Name:             vpc.amazonaws.com/efa        Nominal Quota:    1    ...

A Kueue LocalQueue will be also created, and will reference the corresponding ClusterQueue. The LocalQueue acts as the namespace-scoped resource through which users can submit workloads, and these workloads are then admitted and scheduled according to the quotas and policies defined in the ClusterQueue.

kubectl describe localqueue hyperpod-ns-onlygputeam-localqueue -n hyperpod-ns-onlygputeam# outputName:         hyperpod-ns-onlygputeam-localqueueNamespace:    hyperpod-ns-onlygputeamLabels:       sagemaker.amazonaws.com/quota-allocation-id=onlygputeam              sagemaker.amazonaws.com/sagemaker-managed-queue=trueAnnotations:  <none>API Version:  kueue.x-k8s.io/v1beta1Kind:         LocalQueueMetadata:    ...Spec:  Cluster Queue:  hyperpod-ns-onlygputeam-clusterqueue  Stop Policy:    NoneStatus:  Admitted Workloads:  0

Submitting tasks

There are two ways to submit tasks on Amazon EKS orchestrated SageMaker HyperPod clusters: the SageMaker HyperPod CLI and the Kubernetes command-line tool, kubectl. With both options, data scientists need to reference their team’s namespace and task priority class—in addition to the requested GPU and vCPU compute and memory resources—to use their granular allocated quota with appropriate prioritization. If the user doesn’t specify a priority class, then SageMaker HyperPod task governance will automatically assume the lowest priority. The specific GPU type comes from an instance type selection, because data scientists want to use GPUs with certain capabilities (for example, H100 instead of H200) to perform their tasks efficiently.

HyperPod CLI

The HyperPod CLI was created to abstract the complexities of working with kubectl and so that developers using SageMaker HyperPod can iterate faster with custom commands.The following is an example of a job submission with the HyperPod CLI requesting both compute and memory resources:

hyp create hyp-pytorch-job \--job-name sample-job1 \--image  <account_id>.dkr.ecr.<aws_region>.amazonaws.com/<image_name>:<tag> \--pull-policy "Always" \--tasks-per-node 1 \--max-retry 1 \--priority high-priority \--namespace hyperpod-ns-team1 \--queue-name hyperpod-ns-team1-localqueue \--instance-type ml.g5.8xlarge \--accelerators 1 \--vcpu 4 \--memory 1 \--accelerators-limit 1 \--vcpu-limit 5 \--memory-limit 2

The highlighted parameters enable requesting granular compute and memory resources. The HyperPod CLI requires to install the HyperPod Training Operator in the cluster and then build a container image that includes the HyperPod Elastic Agent. For further instructions on how to build such container image, please refer to the HyperPod Training Operator documentation.

For more information on the supported HyperPod CLI arguments and related description, see the SageMaker HyperPod CLI reference documentation.

Kubectl

The following is an example of a kubectl command to submit a job to the HyperPod cluster using the specified queue. This is a simple example of a PyTorch job that will check for GPU availability and then sleep for 5 minutes. Compute and memory resources are requested using the standard Kubernetes resource management constructs.

apiVersion: batch/v1 kind: Job metadata:   name: gpu-training-job   namespace: hyperpod-ns-team1 spec:   parallelism: 1   completions: 1   suspend: true   template:     metadata:       labels:         kueue.x-k8s.io/queue-name: hyperpod-ns-team1-localqueue        kueue.x-k8s.io/priority-class: high-priority    spec:       containers:       - name: training-container         image: pytorch/pytorch:2.0.0-cuda11.7-cudnn8-runtime        command:         - "python"        - "-c"        - "import torch; print('GPU available:', torch.cuda.is\_available()); import time; time.sleep(15)"         resources:           requests:              nvidia.com/gpu: 1              cpu: "4"              memory: "1Gi"            limits:              nvidia.com/gpu: 1               restartPolicy: Never

Sample commands

Following is a short reference guide for helpful commands when interacting with SageMaker HyperPod task governance:

kubectl get workloads -n hyperpod-ns-<team-name> kubectl describe workload <workload-name> -n hyperpod-ns-<team-name>. These commands show the workloads running in your cluster per namespace and provide detailed reasonings on Kueue admission. You can use these commands to answer questions such as “Why was my task preempted?” or “Why did my task get admitted?”

Common scenarios

A common use case for more granular allocation of GPU compute is fine-tuning small and medium sized large language models (LLMs). A single H100 or H200 GPU might be sufficient to address such a use case (also depending on the chosen batch size and other factors), and machine learning (ML) platform administrators can choose to allocate a single GPU to each data scientist or ML researcher to optimize the utilization of an instance like ml.p5.48xlarge, which comes with 8 H100 GPUs onboard.

Small language models (SLMs) have emerged as a significant advancement in generative AI, offering lower latency, decreased deployment costs, and enhanced privacy capabilities while maintaining impressive performance on targeted tasks, making them increasingly vital for agentic workflows and edge computing scenarios. The new SageMaker HyperPod task governance with fine-grained GPU, CPU, and memory allocation significantly enhances SLM development by enabling precise matching of resources to model requirements, allowing teams to efficiently run multiple experiments concurrently with different architectures. This resource optimization is particularly valuable as organizations develop specialized SLMs for domain-specific applications, with priority-based scheduling so that critical model training jobs receive resources first while maximizing overall cluster utilization. By providing exactly the right resources at the right time, HyperPod accelerates the development of specialized, domain-specific SLMs that can be deployed as efficient agents in complex workflows, enabling more responsive and cost-effective AI solutions across industries.

With the growing popularity of SLMs, organizations can use granular quota allocation to create targeted quota policies that prioritize GPU resources, addressing the budget-sensitive nature of ML infrastructure where GPUs represent the most significant cost and performance factor. Organizations can now selectively apply CPU and memory limits where needed, creating a granular resource management approach that efficiently supports diverse machine learning workloads regardless of model size.

Similarly, to support inference workloads, multiple teams might not require an entire instance to deploy their models, helping to avoid having entire instances equipped with multiple GPUs allocated to each team and leaving GPU compute sitting idle.

Finally, during experimentation and algorithm development, data scientists and ML researchers can choose to deploy a container hosting their preferred IDE on HyperPod, like JupyterLab or Code-OSS (Visual Studio Code open source). In this scenario, they often experiment with smaller batch sizes before scaling to multi-GPU configurations, hence not needing entire multi-GPU instances to be allocated.Similar considerations apply to CPU instances; for example, an ML platform administrator might decide to use CPU instances for IDE deployment, because data scientists prefer to scale their training or fine-tuning with jobs rather than experimenting with the local IDE compute. In such cases, depending on the instances of choice, partitioning CPU cores across the team might be beneficial.

Conclusion

The introduction of fine-grained compute quota allocation in SageMaker HyperPod represents a significant advancement in ML infrastructure management. By enabling GPU-level resource allocation alongside instance-level controls, organizations can now precisely tailor their compute resources to match their specific workloads and team structures.

This granular approach to resource governance addresses critical challenges faced by ML teams today, balancing budget constraints, maximizing expensive GPU utilization, and ensuring fair access across data science teams of all sizes. Whether fine-tuning SLMs that require single GPUs, running inference workloads with varied resource needs, or supporting development environments that don’t require full instance power, this flexible capability helps ensure that no compute resources sit idle unnecessarily.

ML workloads continue to diversify in their resource requirements and SageMaker HyperPod task governance now provides the adaptability organizations need to optimize their GPU capacity investments. To learn more, visit the SageMaker HyperPod product page and HyperPod task governance documentation.

Give this a try in the Amazon SageMaker AI console and leave your comments here.


About the authors

Siamak Nariman is a Senior Product Manager at AWS. He is focused on AI/ML technology, ML model management, and ML governance to improve overall organizational efficiency and productivity. He has extensive experience automating processes and deploying various technologies.

Zhenshan Jin is a Senior Software Engineer at Amazon Web Services (AWS), where he leads software development for task governance on SageMaker HyperPod. In his role, he focuses on empowering customers with advanced AI capabilities while fostering an environment that maximizes engineering team efficiency and productivity.

Giuseppe Angelo Porcelli is a Principal Machine Learning Specialist Solutions Architect for Amazon Web Services. With several years of software engineering and an ML background, he works with customers of any size to understand their business and technical needs and design AI and ML solutions that make the best use of the AWS Cloud and the Amazon Machine Learning stack. He has worked on projects in different domains, including MLOps, computer vision, and NLP, involving a broad set of AWS services. In his free time, Giuseppe enjoys playing football.

Sindhura Palakodety is a Solutions Architect at AWS. She is passionate about helping customers build enterprise-scale Well-Architected solutions on the AWS platform and specializes in the data analytics domain.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

SageMaker HyperPod 细粒度配额分配 资源管理 AWS Kueue 多租户集群
相关文章