生成式AI重塑音乐产业

Generative AI is rapidly reshaping the music industry, empowering creators—regardless of skill—to create studio-quality tracks with foundation models (FMs) that personalize compositions in real time. As demand for unique, instantly generated content grows and creators seek smarter, faster tools, Splash Music collaborated with AWS to develop and scale music generation FMs, making professional music creation accessible to millions.

In this post, we show how Splash Music is setting a new standard for AI-powered music creation by using its advanced HummingLM model with AWS Trainium on Amazon SageMaker HyperPod. As a selected startup in the 2024 AWS Generative AI Accelerator, Splash Music collaborated closely with AWS Startups and the AWS Generative AI Innovation Center (GenAIIC) to fast-track innovation and accelerate their music generation FM development lifecycle.

Challenge: Scaling music generation

Splash Music has empowered a new generation of creators to make music, and has already driven over 600 million streams worldwide. By giving users tools that adapt to their evolving tastes and styles, the service makes music production accessible, fun, and relevant to how fans actually want to create. However, building the technology to unlock this creative freedom, especially the models that power it, meant overcoming several key challenges:

Model complexity and scale –

Rapid pace of change –

Infrastructure scaling –

The service needed a scalable, automated, and cost-effective infrastructure.

Overview of HummingLM: Splash Music’s foundation model

HummingLM is Splash Music’s proprietary, multi-modal generative model, developed in close collaboration with the GenAIIC. It represents an improvement in how AI can interpret and generate music. The model’s architecture is built around a transformer-based large language model (LLM) coupled with a specialized music encoder upsampler:

HummingLM uses Descript-Audio-Codec (DAC) audio encoding to obtain compressed audio representations that capture both frequency and timbre characteristics The system transforms hummed melodies into professional instrumental performances without explicit timbre representation learning

The innovation lies in how HummingLM fuses these token streams. Using a transformer-based backbone, the model learns to blend the melodic intent from humming with the stylistic and structural cues from instrument sound (for example, to make the humming sound like a guitar, piano, flute, or different synthesized sound). Users can hum a tune, add an instrument control signal, and receive a fully arranged, high-fidelity track in return. HummingLM’s architecture is designed for both efficiency and expressiveness. By using discrete token representations, the model achieves faster convergence and reduced computational overhead compared to traditional waveform-based approaches. This makes it possible to train on diverse, large-scale datasets and adapt quickly to new genres or user preferences.

The following diagram illustrates how HummingLM is trained and the inference process to generate high-quality music:

Solution overview: Accelerating model development with AWS Trainium on Amazon SageMaker HyperPod

Splash Music collaborated with the GenAIIC to advance its HummingLM foundation model, using the combined capabilities of Amazon SageMaker HyperPod and AWS Trainium chips for model training.

Splash Music’s architecture follows SageMaker HyperPod best-practices using Amazon Elastic Kubernetes Service (EKS) as the orchestrator, FSx for Lustre for storage to store over 2 PB of data, and AWS Trainium EC2 instances for acceleration. The following diagram illustrates the solution architecture.

In the following sections, we walk through each step of the model development lifecycle, from dataset preparation to compilation for optimized inference.

Dataset preparation

Efficient preparation and processing of large-scale audio datasets is critical for developing controllable music generation models:

Feature extraction pipeline –

Audio processing –

Descript Audio Codec (DAC) extractor –

audio-feature

sine-audio-feature

Parallel processing:

In addition, the solution uses an advanced stem separation system that isolates songs into six distinct audio stems: drums, bass, vocals, lead, chordal, and other instruments:

Stem Preparation:

By streamlining data handling from the outset, we make sure that the subsequent model training stages have access to clean, well-structured features.

Model architecture and optimization

HummingLM employs a dual-component architecture:

LLM for coarse token generation –

Upsampling component –

This division of labor is key to HummingLM’s effectiveness: the LLM captures high-level musical intent, and the upsampling component handles acoustic details. Together with the GenAIIC, Splash collaborated on research to optimize the HummingLM model to facilitate optimal performance:

Flexible control signal design –

Zero-shot capability –

Non-autoregressive generation –

Our evaluation demonstrated HummingLM’s superior first codebook prediction capabilities – a critical factor in residual quantization systems where the first codebook contains most acoustic information. The model consistently outperformed baseline approaches like VALL-E across multiple quality metrics. The evaluation revealed several important findings:

HummingLM demonstrates significant improvements over baseline approaches in signal fidelity (57.93% better SI-SDR) The model maintains robust performance across diverse musical conditions, with particular strength in the Aeolian mode Zero-shot performance on unseen instrument presets is comparable to seen presets, confirming strong generalization capabilities Data augmentation strategies provide substantial benefits (27.70% improvement in SI-SDR)

Overall, HummingLM achieves state-of-the-art controllable music generation by significantly improving signal fidelity, generalizing well to unseen instruments, and delivering strong performance across diverse musical styles, boosted further by effective data augmentation strategies.

Efficient distributed training through parallelism, memory, and AWS Neuron optimization

Splash Music compiled and optimized its model for AWS Neuron, accelerating its model development lifecycle and deployment on AWS Trainium chips. The team considered scalability, parallelization, and memory efficiency and designed a system for supporting models scaling from 2B to over 10B parameters. This includes:

ZeRO-1

Neuron Kernel Interface (NKI)

When optimizations at the Neuron level were complete, optimizing the orchestration layer was important as well. Orchestrated by SageMaker HyperPod, Splash Music developed a robust, Slurm-integrated pipeline that streamlines multi-node training, balances parallelism, and uses activation checkpointing for superior memory efficiency. The pipeline processes data through several critical stages:

Tokenization –

Conditional generation –

Loss functions –

Model Inference on AWS Inferentia on Amazon Elastic Container Service (ECS)

After training, the model is deployed on an Amazon Elastic Container Service (Amazon ECS) cluster with AWS Inferentia instances. The audio is uploaded to Amazon Simple Storage Service (Amazon S3) to handle large volumes of user-submitted recordings, which often vary in quality. Each upload triggers an AWS Lambda function, which queues the file in Amazon Simple Queue Service (Amazon SQS) for delivery to the ECS cluster where inference runs. On the cluster, HummingLM performs two key steps: stem separation to isolate and clean vocals, and audio-to-melody conversion to extract musical structure. Finally, the pipeline recombines the cleaned vocals through a post-processing step with backing tracks, producing the fully processed remixed audio.

Results and impact

Splash Music’s research and development teams now rely on a unified infrastructure built on Amazon SageMaker HyperPod and AWS Trainium chips. The solution has yielded the following benefits:

Automated, resilient and scalable training –

AWS Trainium reduced Splash’s training costs by over 54% –

Splash achieved significant throughput improvements over conventional architectures, to process expansive datasets, supporting the model’s complex multimodal nature. The solution provides a robust foundation for future growth as data and models continue to scale.

“AWS Trainium and SageMaker HyperPod took the friction out of our workflow at Splash Music.” says Daniel Hatadi, Software Engineer, Splash Music. “We replaced brittle GPU clusters with automated, self-healing distributed training that scales seamlessly. Training times are nearly 50% faster, and training costs have dropped by 54%. By relying on AWS AI chips and SageMaker HyperPod and collaborating with the AWS Generative AI Innovation Center, we were able to focus on model design and music-specific research, instead of cluster maintenance. This collaboration has made it easier for us to iterate quickly, run more experiments, train larger models, and keep shipping improvements without needing a bigger team.”

Splash Music also featured in the AWS Summit Sydney 2025 keynote:

Conclusion and Next Steps

Splash Music is redefining how creators bring their musical ideas to life, making it possible for anyone to generate fresh, personalized tracks that resonate with millions of listeners worldwide. To support this vision at scale, Splash built its HummingLM FM in close collaboration with AWS Startups and the GenAIIC, using services such as SageMaker HyperPod and AWS Trainium. These solutions provide the infrastructure and performance needed to keep pace, helping Splash to create even more intuitive and inspiring experiences for creators.

Looking forward, Splash Music plans to expand its training datasets tenfold, explore multimodal audio/video generation, and additionally collaborate with the GenAIIC on additional R&D and its next version of HummingLM FM.

Challenge: Scaling music generation

Overview of HummingLM: Splash Music’s foundation model

Solution overview: Accelerating model development with AWS Trainium on Amazon SageMaker HyperPod

Dataset preparation

Model architecture and optimization

Efficient distributed training through parallelism, memory, and AWS Neuron optimization

Model Inference on AWS Inferentia on Amazon Elastic Container Service (ECS)

Results and impact

Conclusion and Next Steps

About the authors

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签