VentureBeat 10月28日 00:05
Google Cloud推出Vertex AI Training,助力企业构建专属AI模型
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Google Cloud近日推出了一项名为Vertex AI Training的新服务,旨在帮助企业在构建和定制自有大型AI模型方面发挥更大作用。该服务为有此需求的公司提供了一个托管的Slurm环境,配备了数据科学工具和支持大规模模型训练的芯片。此举标志着Google Cloud希望吸引更多企业放弃其他云服务商,转而构建更具公司特色的AI模型。Vertex AI Training不仅允许客户定制Google的Gemini模型,还能引入或定制Google托管的任何开源模型。该服务直接对标CoreWeave、Lambda Labs以及AWS和Microsoft Azure等竞争对手,为企业提供更可靠、更优化的计算环境,尤其针对需要从零开始训练模型的大型项目,而非简单的微调。

🚀 **托管Slurm环境与全栈服务**:Vertex AI Training提供了一个托管的Slurm环境,解决了企业在模型训练中常见的任务调度、故障恢复和时间管理难题。它能够自动检测并重启因硬件故障而中断的训练任务,通过自动检查点机制最大限度地减少停机时间,从而提高训练吞吐量和效率,尤其适用于大规模计算集群。

💡 **支持广泛的芯片与模型定制**:该服务提供对多种支持大规模模型训练的芯片的访问,并允许企业引入自己的模型或定制Google托管的任何开源模型。这为企业提供了极大的灵活性,使其能够根据自身需求构建深度定制的AI解决方案,而不仅仅是进行简单的模型微调或RAG(检索增强生成)。

🎯 **面向大规模模型训练的企业级解决方案**:Google Cloud明确表示,Vertex AI Training主要面向计划进行大规模模型训练的企业,而非仅进行微调或LoRA(低秩适配)的用户。服务侧重于支持运行时间长、可能涉及数百甚至数千个芯片的训练作业,为AI初创公司、科技企业、主权组织以及需要将AI集成到内部流程的大型企业提供支持。

💰 **解决GPU资源获取的成本与挑战**:模型训练,特别是从头开始训练,成本高昂且竞争激烈,尤其是在获取GPU资源方面。Vertex AI Training通过提供对昂贵GPU的访问以及全栈服务,帮助企业克服了这一障碍,使得构建定制模型或从零开始训练模型对于更多组织而言成为可能,而无需自行投入巨资购买和维护硬件。

Some enterprises are best served by fine-tuning large models to their needs, but a number of companies plan to build their own models, a project that would require access to GPUs. 

Google Cloud wants to play a bigger role in enterprises’ model-making journey with its new service, Vertex AI Training. The service gives enterprises looking to train their own models access to a managed Slurm environment, data science tooling and any chips capable of large-scale model training. 

With this new service, Google Cloud hopes to turn more enterprises away from other providers and encourage the building of more company-specific AI models. 

While Google Cloud has always offered the ability to customize its Gemini models, the new service allows customers to bring in their own models or customize any open-source model Google Cloud hosts. 

Vertex AI Training positions Google Cloud directly against companies like CoreWeave and Lambda Labs, as well as its cloud competitors AWS and Microsoft Azure.  

Jaime de Guerre, senior director of product management at Gloogle Cloud, told VentureBeat that the company has been hearing from a lot of organizations of varying sizes that they need a way to better optimize compute but in a more reliable environment.

“What we're seeing is that there's an increasing number of companies that are building or customizing large gen AI models to introduce a product offering built around those models, or to help power their business in some way,” de Guerre said. “This includes AI startups, technology companies, sovereign organizations building a model for a particular region or culture or language and some large enterprises that might be building it into internal processes.”

De Guerre noted that while anyone can technically use the service, Google is targeting companies planning large-scale model training rather than simple fine-tuning or LoRA adopters. Vertex AI Services will focus on longer-running training jobs spanning hundreds or even thousands of chips. Pricing will depend on the amount of compute the enterprise will need. 

“Vertex AI Training is not for adding more information to the context or using RAG; this is to train a model where you might start from completely random weights,” he said.

Model customization on the rise

Enterprises are recognizing the value of building customized models beyond just fine-tuning an LLM via retrieval-augmented generation (RAG). Custom models would know more in-depth company information and respond with answers specific to the organization. Companies like Arcee.ai have begun offering their models for customization to clients. Adobe recently announced a new service that allows enterprises to retrain Firefly for their specific needs. Organizations like FICO, which create small language models specific to the finance industry, often buy GPUs to train them at significant cost. 

Google Cloud said Vertex AI Training differentiates itself by giving access to a larger set of chips, services to monitor and manage training and the expertise it learned from training the Gemini models. 

Some early customers of Vertex AI Training include AI Singapore, a consortium of Singaporean research institutes and startups that built the 27-billion-parameter SEA-LION v4, and Salesforce’s AI research team. 

Enterprises often have to choose between taking an already-built LLM and fine-tuning it or building their own model. But creating an LLM from scratch is usually unattainable for smaller companies, or it simply doesn’t make sense for some use cases. However, for organizations where a fully custom or from-scratch model makes sense, the issue is gaining access to the GPUs needed to run training.

Model training can be expensive

Training a model, de Guerre said, can be difficult and expensive, especially when organizations compete with several others for GPU space.

Hyperscalers like AWS and Microsoft — and, yes, Google — have pitched that their massive data centers and racks and racks of high-end chips deliver the most value to enterprises. Not only will they have access to expensive GPUs, but cloud providers often offer full-stack services to help enterprises move to production.

Services like CoreWeave gained prominence for offering on-demand access to Nvidia H100s, giving customers flexibility in compute power when building models or applications. This has also given rise to a business model in which companies with GPUs rent out server space.

De Guerre said Vertex AI Training isn’t just about offering access to train models on bare compute, where the enterprise rents a GPU server; they also have to bring their own training software and manage the timing and failures. 

“This is a managed Slurm environment that will help with all the job scheduling and automatic recovery of jobs failing,” de Guerre said. “So if a training job slows down or stops due to a hardware failure, the training will automatically restart very quickly, based on automatic checkpointing that we do in management of the checkpoints to continue with very little downtime.”

He added that this provides higher throughput and more efficient training for a larger scale of compute clusters. 

Services like Vertex AI Training could make it easier for enterprises to build niche models or completely customize existing models. Still, just because the option exists doesn’t mean it's the right fit for every enterprise. 

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Google Cloud Vertex AI Training AI模型 企业AI GPU 模型训练 Google Cloud Vertex AI Training AI Models Enterprise AI GPU Model Training
相关文章