AWS Machine Learning Blog 10月18日 00:24
生成式AI重塑音乐产业
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

生成式AI正快速重塑音乐产业,赋能创作者实时个性化音乐创作。Splash Music与AWS合作开发音乐生成基础模型,使专业音乐创作触达百万用户。HummingLM模型结合Transformer架构和音乐编码器,将哼唱转化为高保真音乐。AWS Trainium和SageMaker HyperPod加速模型训练与部署,降低成本54%,提升效率。该技术使音乐创作更易、更智能,推动个性化内容生成。

🎵 HummingLM是Splash Music的专有生成式模型,通过Transformer架构和音乐编码器将哼唱转化为专业音乐,无需学习音色表示,实现实时个性化创作。

🚀 Splash Music利用AWS Trainium和SageMaker HyperPod加速模型训练,降低成本54%,提升效率,实现每周模型更新和快速功能部署,推动音乐创作智能化。

🌐 该平台通过EKS orchestration、FSx for Lustre存储和Trainium EC2实例,构建自动化、弹性且可扩展的训练基础设施,解决大规模模型开发中的计算和存储挑战。

🎧 HummingLM采用多流音频处理和茎分离技术,提取鼓、贝斯、人声等六种音频茎,生成高质量训练数据,并通过混合精度训练优化模型性能,提升信号保真度。

🔗 通过Lambda、SQS和ECS集成,HummingLM在AWS Inferentia上高效推理,实现音频上传、处理和混合,为用户提供无缝的音乐创作体验。

Generative AI is rapidly reshaping the music industry, empowering creators—regardless of skill—to create studio-quality tracks with foundation models (FMs) that personalize compositions in real time. As demand for unique, instantly generated content grows and creators seek smarter, faster tools, Splash Music collaborated with AWS to develop and scale music generation FMs, making professional music creation accessible to millions.

In this post, we show how Splash Music is setting a new standard for AI-powered music creation by using its advanced HummingLM model with AWS Trainium on Amazon SageMaker HyperPod. As a selected startup in the 2024 AWS Generative AI Accelerator, Splash Music collaborated closely with AWS Startups and the AWS Generative AI Innovation Center (GenAIIC) to fast-track innovation and accelerate their music generation FM development lifecycle.

Challenge: Scaling music generation

Splash Music has empowered a new generation of creators to make music, and has already driven over 600 million streams worldwide. By giving users tools that adapt to their evolving tastes and styles, the service makes music production accessible, fun, and relevant to how fans actually want to create. However, building the technology to unlock this creative freedom, especially the models that power it, meant overcoming several key challenges:

The service needed a scalable, automated, and cost-effective infrastructure.

Overview of HummingLM: Splash Music’s foundation model

HummingLM is Splash Music’s proprietary, multi-modal generative model, developed in close collaboration with the GenAIIC. It represents an improvement in how AI can interpret and generate music. The model’s architecture is built around a transformer-based large language model (LLM) coupled with a specialized music encoder upsampler:

The innovation lies in how HummingLM fuses these token streams. Using a transformer-based backbone, the model learns to blend the melodic intent from humming with the stylistic and structural cues from instrument sound (for example, to make the humming sound like a guitar, piano, flute, or different synthesized sound). Users can hum a tune, add an instrument control signal, and receive a fully arranged, high-fidelity track in return. HummingLM’s architecture is designed for both efficiency and expressiveness. By using discrete token representations, the model achieves faster convergence and reduced computational overhead compared to traditional waveform-based approaches. This makes it possible to train on diverse, large-scale datasets and adapt quickly to new genres or user preferences.

The following diagram illustrates how HummingLM is trained and the inference process to generate high-quality music:

Solution overview: Accelerating model development with AWS Trainium on Amazon SageMaker HyperPod

Splash Music collaborated with the GenAIIC to advance its HummingLM foundation model, using the combined capabilities of Amazon SageMaker HyperPod and AWS Trainium chips for model training.

Splash Music’s architecture follows SageMaker HyperPod best-practices using Amazon Elastic Kubernetes Service (EKS) as the orchestrator, FSx for Lustre for storage to store over 2 PB of data, and AWS Trainium EC2 instances for acceleration. The following diagram illustrates the solution architecture.

In the following sections, we walk through each step of the model development lifecycle, from dataset preparation to compilation for optimized inference.

Dataset preparation

Efficient preparation and processing of large-scale audio datasets is critical for developing controllable music generation models:

In addition, the solution uses an advanced stem separation system that isolates songs into six distinct audio stems: drums, bass, vocals, lead, chordal, and other instruments:

By streamlining data handling from the outset, we make sure that the subsequent model training stages have access to clean, well-structured features.

Model architecture and optimization

HummingLM employs a dual-component architecture:

This division of labor is key to HummingLM’s effectiveness: the LLM captures high-level musical intent, and the upsampling component handles acoustic details. Together with the GenAIIC, Splash collaborated on research to optimize the HummingLM model to facilitate optimal performance:

    Flexible control signal design – The model accepts control signals of varying durations (1-5 seconds), a significant improvement over fixed-window approaches. Zero-shot capability – Unlike systems requiring explicit timbre embedding learning, HummingLM can generalize to unseen instrument presets without additional training. Non-autoregressive generation – The upsampling component uses parallel token prediction for significantly faster inference compared to traditional autoregressive approaches.

Our evaluation demonstrated HummingLM’s superior first codebook prediction capabilities – a critical factor in residual quantization systems where the first codebook contains most acoustic information. The model consistently outperformed baseline approaches like VALL-E across multiple quality metrics. The evaluation revealed several important findings:

Overall, HummingLM achieves state-of-the-art controllable music generation by significantly improving signal fidelity, generalizing well to unseen instruments, and delivering strong performance across diverse musical styles, boosted further by effective data augmentation strategies.

Efficient distributed training through parallelism, memory, and AWS Neuron optimization

Splash Music compiled and optimized its model for AWS Neuron, accelerating its model development lifecycle and deployment on AWS Trainium chips. The team considered scalability, parallelization, and memory efficiency and designed a system for supporting models scaling from 2B to over 10B parameters. This includes:

When optimizations at the Neuron level were complete, optimizing the orchestration layer was important as well. Orchestrated by SageMaker HyperPod, Splash Music developed a robust, Slurm-integrated pipeline that streamlines multi-node training, balances parallelism, and uses activation checkpointing for superior memory efficiency. The pipeline processes data through several critical stages:

Model Inference on AWS Inferentia on Amazon Elastic Container Service (ECS)

After training, the model is deployed on an Amazon Elastic Container Service (Amazon ECS) cluster with AWS Inferentia instances. The audio is uploaded to Amazon Simple Storage Service (Amazon S3) to handle large volumes of user-submitted recordings, which often vary in quality. Each upload triggers an AWS Lambda function, which queues the file in Amazon Simple Queue Service (Amazon SQS) for delivery to the ECS cluster where inference runs. On the cluster, HummingLM performs two key steps: stem separation to isolate and clean vocals, and audio-to-melody conversion to extract musical structure. Finally, the pipeline recombines the cleaned vocals through a post-processing step with backing tracks, producing the fully processed remixed audio.

Results and impact

Splash Music’s research and development teams now rely on a unified infrastructure built on Amazon SageMaker HyperPod and AWS Trainium chips. The solution has yielded the following benefits:

Splash achieved significant throughput improvements over conventional architectures, to process expansive datasets, supporting the model’s complex multimodal nature. The solution provides a robust foundation for future growth as data and models continue to scale.

“AWS Trainium and SageMaker HyperPod took the friction out of our workflow at Splash Music.” says Daniel Hatadi, Software Engineer, Splash Music. “We replaced brittle GPU clusters with automated, self-healing distributed training that scales seamlessly. Training times are nearly 50% faster, and training costs have dropped by 54%. By relying on AWS AI chips and SageMaker HyperPod and collaborating with the AWS Generative AI Innovation Center, we were able to focus on model design and music-specific research, instead of cluster maintenance. This collaboration has made it easier for us to iterate quickly, run more experiments, train larger models, and keep shipping improvements without needing a bigger team.”

Splash Music also featured in the AWS Summit Sydney 2025 keynote:

Conclusion and Next Steps

Splash Music is redefining how creators bring their musical ideas to life, making it possible for anyone to generate fresh, personalized tracks that resonate with millions of listeners worldwide. To support this vision at scale, Splash built its HummingLM FM in close collaboration with AWS Startups and the GenAIIC, using services such as SageMaker HyperPod and AWS Trainium. These solutions provide the infrastructure and performance needed to keep pace, helping Splash to create even more intuitive and inspiring experiences for creators.

“With SageMaker HyperPod and Trainium, our researchers experiment as fast as our community creates.” says Randeep Bhatia, Chief Technology Officer, Splash Music. “We’re not just keeping up with music trends—we’re setting them.”

Looking forward, Splash Music plans to expand its training datasets tenfold, explore multimodal audio/video generation, and additionally collaborate with the GenAIIC on additional R&D and its next version of HummingLM FM.

Try creating your own music using Splash Music, and learn more about Amazon SageMaker HyperPod and AWS Trainium.


About the authors

Sheldon Liu is an Senior Applied Scientist, ANZ Tech Lead at the AWS Generative AI Innovation Center. He partners with AWS customers across diverse industries to develop and implement innovative generative AI solutions, accelerating their AI adoption journey while driving significant business outcomes.

Mahsa Paknezhad is a Deep Learning Architect and a key member of the AWS Generative AI Innovation Center. She works closely with enterprise clients to design, implement, and optimize cutting-edge generative AI solutions. With a focus on scalability and production readiness, Mahsa helps organizations across diverse industries harness advanced Generative AI models to achieve meaningful business outcomes.

Xiaoning Wang is a machine learning engineer at the AWS Generative AI Innovation Center. He specializes in large language model training and optimization on AWS Trainium and Inferentia, with experience in distributed training, RAG, and low-latency inference. He works with enterprise customers to build scalable generative AI solutions that drive real business impact.

Tianyu Liu is an applied scientist at the AWS Generative AI Innovation Center. He partners with enterprise customers to design, implement, and optimize cutting-edge generative AI models, advancing innovation and helping organizations achieve transformative results with scalable, production-ready AI solutions.

Xuefeng Liu leads a science team at the AWS Generative AI Innovation Center in the Asia Pacific regions. His team partners with AWS customers on generative AI projects, with the goal of accelerating customers’ adoption of generative AI.

Daniel Wirjo is a Solutions Architect at AWS, focused on AI and SaaS startups. As a former startup CTO, he enjoys collaborating with founders and engineering leaders to drive growth and innovation on AWS. Outside of work, Daniel enjoys taking walks with a coffee in hand, appreciating nature, and learning new ideas.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

生成式AI 音乐创作 AWS HummingLM 深度学习
相关文章