AWS Machine Learning Blog 09月16日
SageMaker HyperPod 优化 AI 训练效率与网络延迟
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Amazon SageMaker HyperPod 引入了新的任务治理能力,旨在提升 AI 工作负载的训练效率和网络延迟。通过 SageMaker HyperPod 任务治理,可以简化资源分配,促进 Amazon EKS 集群上的跨团队和项目计算资源的高效利用。管理员可以管理加速计算分配并执行任务优先级策略,从而提高资源利用率。此外,文章详细介绍了如何利用 EC2 网络拓扑信息来优化生成式 AI 工作负载的放置,通过最小化网络跃点和将流量路由到附近实例来减少延迟,从而提高训练效率。文中提供了获取节点拓扑信息和提交拓扑感知任务的详细步骤和示例。

💡 **SageMaker HyperPod 任务治理提升效率与降低延迟**:该功能通过优化资源分配和任务优先级策略,显著提升了 AI 工作负载在 Amazon EKS 集群上的训练效率和计算资源利用率,同时减少了网络延迟,使组织能更专注于创新而非资源协调。

🌐 **利用 EC2 网络拓扑优化工作负载放置**:生成式 AI 工作负载依赖于实例间的频繁网络通信,网络延迟直接影响训练时间和处理速度。通过考虑实例在数据中心内的物理和逻辑布局,利用 EC2 网络拓扑信息,可以最小化网络跃点,将流量路由到更近的实例,从而有效降低延迟并提高训练效率。

🛠️ **实现拓扑感知调度**:文章详细介绍了如何获取节点拓扑信息(包括网络节点层级 1、2、3)并提供了可视化脚本。用户可以通过修改 Kubernetes manifest 文件,使用 `kueue.x-k8s.io/podset-required-topology` 或 `kueue.x-k8s.io/podset-preferred-topology` 注释,或使用 SageMaker HyperPod CLI 的 `--required-topology` 或 `--preferred-topology` 参数来提交拓扑感知的任务,确保 Pod 被调度到网络连接最优的节点上。

⚙️ **部署先决条件与流程**:要实现拓扑感知调度,需要满足一系列先决条件,包括拥有 EKS 集群、启用拓扑信息的 SageMaker HyperPod 集群、安装 SageMaker HyperPod 任务治理插件(版本 1.2.2 或更高版本)以及 Kubectl。文章明确了获取节点拓扑信息、可视化集群拓扑以及提交任务的具体命令和步骤。

Today, we are excited to announce a new capability of Amazon SageMaker HyperPod task governance to help you optimize training efficiency and network latency of your AI workloads. SageMaker HyperPod task governance streamlines resource allocation and facilitates efficient compute resource utilization across teams and projects on Amazon Elastic Kubernetes Service (Amazon EKS) clusters. Administrators can govern accelerated compute allocation and enforce task priority policies, improving resource utilization. This helps organizations focus on accelerating generative AI innovation and reducing time to market, rather than coordinating resource allocation and replanning tasks. Refer to Best practices for Amazon SageMaker HyperPod task governance for more information.

Generative AI workloads typically demand extensive network communication across Amazon Elastic Compute Cloud (Amazon EC2) instances, where network bandwidth impacts both workload runtime and processing latency. The network latency of these communications depends on the physical placement of instances within a data center’s hierarchical infrastructure. Data centers can be organized into nested organizational units such as network nodes and node sets, with multiple instances per network node and multiple network nodes per node set. For example, instances within the same organizational unit experience faster processing time compared to those across different units. This means fewer network hops between instances results in lower communication.

To optimize the placement of your generative AI workloads in your SageMaker HyperPod clusters by considering the physical and logical arrangement of resources, you can use EC2 network topology information during your job submissions. An EC2 instance’s topology is described by a set of nodes, with one node in each layer of the network. Refer to How Amazon EC2 instance topology works for details on how EC2 topology is arranged. Network topology labels offer the following key benefits:

With topology-aware scheduling for SageMaker HyperPod task governance, you can use topology network labels to schedule your jobs with optimized network communication, thereby improving task efficiency and resource utilization for your AI workloads.

In this post, we introduce topology-aware scheduling with SageMaker HyperPod task governance by submitting jobs that represent hierarchical network information. We provide details about how to use SageMaker HyperPod task governance to optimize your job efficiency.

Solution overview

Data scientists interact with SageMaker HyperPod clusters. Data scientists are responsible for the training, fine-tuning, and deployment of models on accelerated compute instances. It’s important to make sure data scientists have the necessary capacity and permissions when interacting with clusters of GPUs.

To implement topology-aware scheduling, you first confirm the topology information for all nodes in your cluster, then run a script that tells you which instances are on the same network nodes, and finally schedule a topology-aware training task on your cluster. This workflow facilitates higher visibility and control over the placement of your training instances.

In this post, we walk through viewing node topology information and submitting topology-aware tasks to your cluster. For reference, NetworkNodes describes the network node set of an instance. In each network node set, three layers comprise the hierarchical view of the topology for each instance. Instances that are closest to each other will share the same layer 3 network node. If there are no common network nodes in the bottom layer (layer 3), then see if there is commonality at layer 2.

Prerequisites

To get started with topology-aware scheduling, you must have the following prerequisites:

Get node topology information

Run the following command to show node labels in your cluster. This command provides network topology information for each instance.

kubectl get nodes -L topology.k8s.aws/network-node-layer-1kubectl get nodes -L topology.k8s.aws/network-node-layer-2kubectl get nodes -L topology.k8s.aws/network-node-layer-3

Instances with the same network node layer 3 are as close as possible, following EC2 topology hierarchy. You should see a list of node labels that look like the following:topology.k8s.aws/network-node-layer-3: nn-33333exampleRun the following script to show the nodes in your cluster that are on the same layers 1, 2, and 3 network nodes:

git clone https://github.com/aws-samples/awsome-distributed-training.gitcd awsome-distributed-training/1.architectures/7.sagemaker-hyperpod-eks/task-governance chmod +x visualize_topology.shbash visualize_topology.sh

The output of this script will print a flow chart that you can use in a flow diagram editor such as Mermaid.js.org to visualize the node topology of your cluster. The following figure is an example of the cluster topology for a seven-instance cluster.

Submit tasks

SageMaker HyperPod task governance offers two ways to submit tasks using topology awareness. In this section, we discuss these two options and a third alternative option to task governance.

Modify your Kubernetes manifest file

First, you can modify your existing Kubernetes manifest file to include one of two annotation options:

The following code is an example of a sample job that uses the kueue.x-k8s.io/podset-required-topology setting to schedule pods that share the same layer 3 network node:

apiVersion: batch/v1kind: Jobmetadata:  name: test-tas-job  namespace: hyperpod-ns-team-a  labels:    kueue.x-k8s.io/queue-name: hyperpod-ns-team-a-localqueue    kueue.x-k8s.io/priority-class: inference-priorityspec:  parallelism: 10  completions: 10  suspend: true  template:    metadata:      labels:        kueue.x-k8s.io/queue-name: hyperpod-ns-team-a-localqueue      annotations:        kueue.x-k8s.io/podset-required-topology: "topology.k8s.aws/network-node-layer-3"    spec:      containers:        - name: dummy-job          image: public.ecr.aws/docker/library/alpine:latest          command: ["sleep", "3600s"]          resources:            requests:              cpu: "1"      restartPolicy: Never

To verify which nodes your pods are running on, use the following command to view node IDs per pod:kubectl get pods -n hyperpod-ns-team-a -o wide

Use the SageMaker HyperPod CLI

The second way to submit a job is through the SageMaker HyperPod CLI. Be sure to install the latest version (version pending) to use topology-aware scheduling. To use topology-aware scheduling with the SageMaker HyperPod CLI, you can include either the --preferred-topology parameter or the --required-topology parameter in your create job command.

The following code is an example command to start a topology-aware mnist training job using the SageMaker HyperPod CLI, replace XXXXXXXXXXXX with your AWS account ID:

hyp create hyp-pytorch-job \--job-name test-pytorch-job-cli \--image XXXXXXXXXXXX.dkr.ecr.us-west-2.amazonaws.com/ptjob:mnist \--pull-policy "Always" \--tasks-per-node 1 \--max-retry 1 \--preferred-topology topology.k8s.aws/network-node-layer-3

Clean up

If you deployed new resources while following this post, refer to the Clean Up section in the SageMaker HyperPod EKS workshop to make sure you don’t accrue unwanted charges.

Conclusion

During large language model (LLM) training, pod-to-pod communication distributes the model across multiple instances, requiring frequent data exchange between these instances. In this post, we discussed how SageMaker HyperPod task governance helps schedule workloads to enable job efficiency by optimizing throughput and latency. We also walked through how to schedule jobs using SageMaker HyperPod topology network information to optimize network communication latency for your AI tasks.

We encourage you to try out this solution and share your feedback in the comments section.


About the authors

Nisha Nadkarni is a Senior GenAI Specialist Solutions Architect at AWS, where she guides companies through best practices when deploying large scale distributed training and inference on AWS. Prior to her current role, she spent several years at AWS focused on helping emerging GenAI startups develop models from ideation to production.

Siamak Nariman is a Senior Product Manager at AWS. He is focused on AI/ML technology, ML model management, and ML governance to improve overall organizational efficiency and productivity. He has extensive experience automating processes and deploying various technologies.

Zican Li is a Senior Software Engineer at Amazon Web Services (AWS), where he leads software development for Task Governance on SageMaker HyperPod. In his role, he focuses on empowering customers with advanced AI capabilities while fostering an environment that maximizes engineering team efficiency and productivity.

Anoop Saha is a Sr GTM Specialist at Amazon Web Services (AWS) focusing on generative AI model training and inference. He partners with top frontier model builders, strategic customers, and AWS service teams to enable distributed training and inference at scale on AWS and lead joint GTM motions. Before AWS, Anoop held several leadership roles at startups and large corporations, primarily focusing on silicon and system architecture of AI infrastructure.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Amazon SageMaker HyperPod AI Training Task Governance Network Latency EC2 Network Topology Kubernetes EKS Generative AI Optimization Resource Allocation
相关文章