cs.AI updates on arXiv.org 10月03日
Kant:AI集群高效调度平台
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍并评估了Kant系统,一个针对大规模AI容器集群设计的调度平台,支持训练和推理作业的协同调度。通过关键性能指标评估,Kant在百到万GPU的集群中表现出色,通过调度策略优化资源利用和效率。

arXiv:2510.01256v1 Announce Type: cross Abstract: As AI cluster sizes continue to expand and the demand for large-language-model (LLM) training and inference workloads grows rapidly, traditional scheduling systems face significant challenges in balancing resource utilization, scheduling efficiency, and service quality. This paper presents and evaluates Kant: an efficient unified scheduling platform designed for large-scale AI container clusters, supporting the co-scheduling of both training and inference jobs. Based on the practical implementation of the Kant system, we systematically define a set of key evaluation metrics for AI clusters, including GPU Allocation Ratio (GAR), Scheduling Occupancy Rate (SOR), GPU Node Fragmentation Ratio (GFR), Job Waiting Time Distribution (JWTD), and Job Training Time Estimation Distribution (JTTED), providing a foundation for quantitative performance analysis. Experimental results demonstrate that Kant achieves exceptional performance in clusters ranging from hundreds to tens of thousands of GPUs. By leveraging scheduling strategies such as Backfill and Enhanced Binpack (E-Binpack), the system significantly improves resource utilization and scheduling efficiency, while effectively reducing resource fragmentation and communication overhead in distributed training. The system has been deployed in multiple AI data center clusters, where it stably supports large-scale intelligent computing workloads. This work provides a practical engineering approach for building high-performance, highly available, AI-native scheduling infrastructure.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI集群 调度平台 资源优化 性能评估 Kant系统
相关文章