Nvidia Developer 10月03日 22:27
NVIDIA KAI Scheduler 赋能 Ray 集群智能调度
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

NVIDIA KAI Scheduler现已原生集成至KubeRay,将NVIDIA Run:ai的强大调度引擎引入Ray集群。此集成带来了诸如协同调度、工作负载自动伸缩、优先级排序及层级队列等高级功能,显著提升了Ray环境的资源利用率和响应速度。通过协调作业启动、高效共享GPU以及智能化工作负载优先级管理,KAI Scheduler使基础设施更加智能。文章详细介绍了KAI Scheduler在Ray中的应用,包括如何配置队列、实现分布式Ray工作负载的协同调度,以及如何通过设置优先级让高优先级推理作业能够抢占低优先级训练作业,从而确保关键应用的性能和用户体验。

✨ **KAI Scheduler 核心功能集成**:NVIDIA KAI Scheduler现已原生集成至KubeRay,为Ray集群带来了NVIDIA Run:ai强大的调度能力,包括协同调度(Gang Scheduling)、工作负载自动伸缩、工作负载优先级排序以及层级队列等功能。这使得Ray集群能够更高效地管理和调度资源,尤其是在高需求和大规模环境中。

🚀 **智能化的工作负载管理**:KAI Scheduler通过协调作业启动、优化GPU共享以及智能化优先级管理,提升了基础设施的效率。它确保分布式Ray工作负载能够整体启动(Gang Scheduling),避免了部分启动造成的资源浪费;支持根据需求自动伸缩集群资源,实现弹性计算;并允许高优先级推理作业在资源受限时抢占低优先级训练作业,保证了关键应用的响应速度。

📊 **层级队列与优先级策略**:文章演示了如何设置层级队列,例如创建名为`department-1`的父队列和`team-a`的子队列,以反映组织结构并实现精细的资源控制。通过`priorityClassName`标签,可以区分并调度不同优先级的任务,例如`train`(训练)和`inference`(推理),从而实现不同类型工作负载的平稳共存,并确保关键服务的可用性。

💡 **生产级部署建议**:文章强调了在生产环境中,模型加载时间是关键因素。虽然示例直接加载Hugging Face模型,但推荐使用NVIDIA Run:ai Model Streamer,它与vLLM原生集成,并可与Ray无缝配合,通过从S3等高带宽存储直接流式传输模型权重到GPU内存,显著缩短模型启动时间。

NVIDIA KAI Scheduler is now natively integrated with KubeRay, bringing the same scheduling engine that powers high‑demand and high-scale environments in NVIDIA Run:ai directly into your Ray clusters.
This means you can now tap into gang scheduling, workload autoscaling, workload prioritization, hierarchical queues, and many more features in your Ray environment. Together, these capabilities make your infrastructure smarter by coordinating job starts, sharing GPUs efficiently, and prioritizing workloads. And all you have to do is configure it.

What this means for Ray users:

    Gang scheduling: no partial startups

    Distributed Ray workloads need all their workers and actors to start together—or not at all. KAI ensures they launch as a coordinated gang, preventing wasteful partial allocations that stall training or inference pipelines.Workload and cluster autoscaling

    For workloads such as offline batch inference, Ray clusters can scale up as cluster resources become available or when queues permit over-quota usage. They can also scale down as demand decreases, providing elastic compute aligned with resource availability and workload needs without manual intervention.Workload priorities: smooth coexistence of different types of workloads

    High‑priority inference jobs can automatically preempt lower‑priority batch training if resources are limited, keeping your applications responsive without manual intervention.Hierarchical queuing with priorities: dynamic resource sharing

    Create queues for different project teams with clear priorities so that when capacity is available, the higher‑priority queue can borrow idle resources from other teams. 

In this post, we’ll walk through a hands-on example of how KAI enables smarter resource allocation and responsiveness for Ray—particularly in clusters where training and online inference must coexist. You’ll see how to:

    Schedule distributed Ray workers as a gang, ensuring coordinated startup.Leverage priority-based scheduling, where inference jobs preempt lower-priority training jobs.

The result is a tightly integrated execution stack, built from tools designed to work together, from scheduling policies to model serving.

Technical setup

This example assumes the following environment:

--set batchScheduler.name=kai-scheduler

Step 1: Set up KAI Scheduler queues

Before submitting Ray workloads, queues must be defined for the KAI Scheduler. KAI Scheduler supports hierarchical queuing, which enables teams and departments to be organized into multi-level structures with fine-grained control over resource distribution.

In this example, a simple two-level hierarchy will be created with a top-level parent queue called department-1 and a child queue called team-a. All workloads in this demo will be submitted through team-a, but in a real deployment, multiple departments and teams can be configured to reflect organizational boundaries.

apiVersion: scheduling.run.ai/v2kind: Queuemetadata:  name: department-1spec:  resources:    cpu:      quota: -1      limit: -1      overQuotaWeight: 1    gpu:      quota: -1      limit: -1      overQuotaWeight: 1    memory:      quota: -1      limit: -1      overQuotaWeight: 1---apiVersion: scheduling.run.ai/v2kind: Queuemetadata:  name: team-aspec:  parentQueue: department-1  resources:    cpu:      quota: -1      limit: -1      overQuotaWeight: 1    gpu:      quota: -1      limit: -1      overQuotaWeight: 1    memory:      quota: -1      limit: -1      overQuotaWeight: 1

A quick breakdown of the key parameters:

    Quota: The deserved share of resources to which a queue is entitled.Limit: The upper bound on how many resources a queue can consume.Over Quota Weight: Determines how surplus resources are distributed among queues that have the same priority. Queues with higher weights receive a larger portion of the extra capacity.

In this demo, no specific quotas, limits, or priorities are enforced. We’re keeping it simple to focus on the mechanics of integration. However, these fields provide powerful tools for enforcing fairness and managing contention across organizational boundaries.

To create the queues:

kubectl apply -f kai-scheduler-queue.yaml

With the queue hierarchy now in place, workloads can be submitted and scheduled under team-a.

Step 2: Submit a training job with gang scheduling and workload prioritization

With the queues in place, it’s time to run a training workload using KAI’s gang scheduling.

In this example, we define a simple Ray cluster with one head node and two worker replicas.

KAI schedules all Kubernetes Pods (the two workers and the head) as a gang, meaning they launch together or not at all, and if preemption occurs, they’re stopped together too.

The only required configuration to enable KAI scheduling is the kai.scheduler/queue label, which assigns the job to a KAI queue—in this case, team-a.

An optional setting, priorityClassName: train, marks the job as a preemptible training workload. Here it is included to illustrate how KAI applies workload prioritization. For more information on workload priority, please refer to the official documentation.

Here’s the manifest used in this demo:

​​apiVersion: ray.io/v1kind: RayClustermetadata:  name: raycluster-sample  labels:    kai.scheduler/queue: team-a    priorityClassName: trainspec:  headGroupSpec:    template:      spec:        containers:        - name: ray-head          image: rayproject/ray:2.46.0          resources:            limits:              cpu: 4              memory: 15Gi            requests:              cpu: "1"              memory: "2Gi" # ---- One Worker with a GPU ----  workerGroupSpecs:  - groupName: worker    replicas: 1    minReplicas: 1    maxReplicas: 2    template:      spec:        containers:        - name: ray-worker          image: rayproject/ray:2.46.0          resources:            limits:              cpu: 4              memory: 15Gi            requests:              cpu: "1"              memory: "1Gi"              nvidia.com/gpu: "1"

To apply the workload:

kubectl apply -f kai-scheduler-example.yaml

KAI  then gang-schedules the Ray head and two workers. 

Step 3: Deploy an inference service with higher priority using vLLM

Now that we’ve submitted a training workload, let’s walk through how KAI Scheduler handles inference workloads, which are non-preemptible and higher priority by default. This distinction enables inference workloads to preempt lower-priority training jobs when GPU resources are limited, ensuring fast model responses for user-facing services.

In this example, we’ll:

    Deploy Qwen2.5-7B-Instruct using vLLM with Ray Serve and RayService.Submit the job to the same queue (team-a) as the training job.Use the label kai.scheduler/queue to enable KAI scheduling.Set the priorityClassName to inference to mark this as a high-priority workload.

Note: The only required label for scheduling with KAI is kai.scheduler/queue.The priorityClassName: inference used here is optional and specific to this demo to demonstrate workload preemption. Also, be sure to create a Kubernetes secret named ‘hf-token’ containing your Hugging Face token before applying the YAML.

Here’s the manifest (make sure to add your own HF token at the Secret):

apiVersion: ray.io/v1kind: RayServicemetadata:  name: ray-kai-scheduler-serve-llm  labels:    kai.scheduler/queue: team-a          priorityClassName: inference     spec:  serveConfigV2: |    applications:    - name: llms      import_path: ray.serve.llm:build_openai_app      route_prefix: "/"      args:        llm_configs:        - model_loading_config:            model_id: qwen2.5-7b-instruct            model_source: Qwen/Qwen2.5-7B-Instruct          engine_kwargs:            dtype: bfloat16            max_model_len: 1024            device: auto            gpu_memory_utilization: 0.75          deployment_config:            autoscaling_config:              min_replicas: 1              max_replicas: 1              target_ongoing_requests: 64            max_ongoing_requests: 128  rayClusterConfig:    rayVersion: "2.46.0"    headGroupSpec:      rayStartParams:        num-cpus: "0"        num-gpus: "0"      template:        spec:          containers:          - name: ray-head            image: rayproject/ray-llm:2.46.0-py311-cu124            ports:            - containerPort: 8000              name: serve              protocol: TCP            - containerPort: 8080              name: metrics              protocol: TCP            - containerPort: 6379              name: gcs              protocol: TCP            - containerPort: 8265              name: dashboard              protocol: TCP            - containerPort: 10001              name: client              protocol: TCP            resources:              limits:                cpu: 4                memory: 16Gi              requests:                cpu: 1                memory: 4Gi    workerGroupSpecs:    - replicas: 1      minReplicas: 1      maxReplicas: 1      numOfHosts: 1      groupName: gpu-group      rayStartParams:        num-gpus: "1"      template:        spec:          containers:          - name: ray-worker            image: rayproject/ray-llm:2.46.0-py311-cu124            env:            - name: HUGGING_FACE_HUB_TOKEN              valueFrom:                secretKeyRef:                  name: hf-token                  key: hf_token            resources:              limits:                cpu: 4                memory: 15Gi                nvidia.com/gpu: "1"              requests:                cpu: 1                memory: 15Gi                nvidia.com/gpu: "1"---apiVersion: v1kind: Secretmetadata:  name: hf-tokentype: OpaquestringData:  hf_token: $HF_TOKEN

Apply the workload:

kubectl apply -f ray-service.kai-scheduler.llm-serve.yaml

Loading the model and starting the vLLM engine will take some time here.

Observe preemption in action

Once applied, you’ll notice that KAI Scheduler preempts the training job to make room for the inference workload, since both compete for the same GPU, but the inference workload has higher priority.

Example output from kubectl get pods:

$ kubectl get podsNAME                                                   READY   STATUS    RESTARTS   AGEray-kai-scheduler-serve-llm-xxxx-gpu-group-worker-xxxx 1/1     Running   0          21mray-kai-scheduler-serve-llm-xxxx-head-xxxx             1/1     Running   0          21mraycluster-sample-head-xxxx                            0/1     Running   0          21mraycluster-sample-worker-worker-xxxx                   0/1     Running   0          21m

 For the sake of simplicity in this demo, we used the Hugging Face model loading directly inside the container. This works for showcasing KAI Scheduler logic and preemption behavior. However, in real production environments, model loading time becomes a critical factor, especially when autoscaling inference replicas or recovering from eviction.

For that, we recommend using NVIDIA Run:ai Model Streamer, which is natively supported in vLLM and can be used out-of-the-box with Ray. For reference, please refer to the Ray documentation that includes an example showing how to configure the Model Streamer in your Ray workloads.

Interact with the deployed model

Before we port forward to access the Ray dashboard or the inference endpoint, let’s list the available services to ensure we target the correct one:

$ kubectl get svcNAME                                         TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                                         AGEray-kai-scheduler-serve-llm-head-svc         ClusterIP   None            xxxxxx        10001/TCP,8265/TCP,6379/TCP,8080/TCP,8000/TCP   17mray-kai-scheduler-serve-llm-xxxxx-head-svc   ClusterIP   None            xxxxxx        10001/TCP,8265/TCP,6379/TCP,8080/TCP,8000/TCP   24mray-kai-scheduler-serve-llm-serve-svc        ClusterIP   xx.xxx.xx.xxx   xxxxxx        8000/TCP                                        17mraycluster-sample-head-svc                   ClusterIP   None            xxxxxx        10001/TCP,8265/TCP,6379/TCP,8080/TCP,8000/TCP   32m

Now that we can see the service names, we’ll use:

    ray-kai-scheduler-serve-llm-xxxxx-head-svc to forward the Ray dashboard.ray-kai-scheduler-serve-llm-serve-svc to forward the model’s endpoint.

Then, port forward the Ray dashboard:

kubectl port-forward svc/ray-kai-scheduler-serve-llm-xxxxx-head-svc 8265:8265

Then open http://127.0.0.1:8265 to view the Ray dashboard and confirm the deployment is active.

Figure 1. Overview of the Ray dashboard
Figure 2. Overview of deployment logs

Next, port forward the LLM serve endpoint:

kubectl port-forward svc/ray-kai-scheduler-serve-llm-serve-svc 8000:8000

And finally, query the model using an OpenAI-compatible API call:

curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{        "model": "qwen2.5-7b-instruct",        "prompt": "San Francisco is a",        "max_tokens": 13,        "temperature": 0}'

Sample response:

{  "id": "qwen2.5-7b-instruct-xxxxxx",  "object": "text_completion",  "created": 1753172931,  "model": "qwen2.5-7b-instruct",  "choices": [    {      "index": 0,      "text": " city of neighborhoods, each with its own distinct character and charm.",      "finish_reason": "length"    }  ],  "usage": {    "prompt_tokens": 4,    "completion_tokens": 13,    "total_tokens": 17  }}

Wrapping up

In this blog, we explored how the KAI Scheduler advances scheduling capabilities to Ray workloads, including gang scheduling and hierarchical queuing. We demonstrated how training and inference workloads can be efficiently prioritized, with inference workloads able to preempt training jobs when resources are limited. 

While this demo used a simple open-weight model and Hugging Face for convenience, NVIDIA Run:ai Model Streamer is a production-grade option that reduces model spin-up times by streaming model weights directly from S3 or other high-bandwidth storage to GPU memory. It’s also natively integrated with vLLM and works out of the box with Ray, as shown in this example from Ray’s docs. We’re excited to see what the community builds with this stack. Happy scheduling. 

The KAI Scheduler team will be at KubeCon North America in Atlanta this November. Have questions about gang scheduling, workload auto-scaling, or AI workload optimization? Find us at our booth or sessions.

Get started with KAI Scheduler.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

NVIDIA KAI Scheduler KubeRay Ray Kubernetes AI Scheduling GPU Management Machine Learning Deep Learning Distributed Computing NVIDIA Run:ai
相关文章