MarkTechPost@AI 09月13日
Google发布VaultGemma:首个大规模差分隐私预训练的开源模型
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Google AI Research与DeepMind联合发布了VaultGemma 1B,这是首个完全采用差分隐私(DP)技术训练的、参数量最大的开源大语言模型。此举标志着在构建兼具强大能力和隐私保护的AI模型方面迈出了重要一步。VaultGemma 1B借鉴了Gemma模型的架构,并针对私有训练进行了优化,采用1B参数和26层Transformer结构。该模型在海量数据上进行了全过程的差分隐私预训练,而非仅在微调阶段应用,确保了从基础层面的隐私保护。尽管在学术基准测试上,VaultGemma的效用相比非私有模型有所下降,但它成功地实现了差分隐私的数学保证,并且在记忆化测试中未检测到任何训练数据泄露,为未来开发更安全、透明、隐私保护的AI模型奠定了坚实基础。

🔒 **差分隐私预训练的里程碑**:VaultGemma 1B 是首个完全使用差分隐私(DP)技术从头开始训练的大规模开源语言模型。与仅在微调阶段应用DP不同,VaultGemma实现了从基础训练阶段就引入严格的隐私保护,为构建更安全的AI模型树立了新标杆。

🧠 **模型架构与优化**:该模型基于Gemma架构,拥有10亿参数,采用26层Decoder-only Transformer设计。为了适应DP训练的计算成本和批次大小限制,其序列长度被缩短至1024个token,并采用了多查询注意力(MQA)等优化技术。

🛡️ **严格的数据隐私保障**:VaultGemma在包含网络文档、代码和科学文章的13万亿token数据集上进行训练。在训练前,数据集经过多轮严格过滤,以移除不安全或敏感内容,并最大限度地减少个人信息暴露,确保了模型的安全性和公平性。通过DP-SGD方法,模型获得了序列层面的正式DP保证(ε ≤ 2.0, δ ≤ 1.1e−10)。

📈 **私有训练的创新Scaling Laws**:为了在DP约束下高效训练大型模型,研究团队开发了DP特定的Scaling Laws,通过最优学习率建模、损失值参数外推和半参数拟合等创新方法,实现了对模型性能的精确预测和训练资源的优化利用。

⚖️ **效用与隐私的权衡**:虽然VaultGemma在学术基准测试上的表现略逊于非私有模型,但其关键优势在于成功实现了差分隐私的数学保证,并且在记忆化测试中未检测到任何训练数据泄露,证明了在强大AI能力与用户隐私保护之间实现可行平衡的可能性。

Google AI Research and DeepMind have released VaultGemma 1B, the largest open-weight large language model trained entirely with differential privacy (DP). This development is a major step toward building AI models that are both powerful and privacy-preserving.

Why Do We Need Differential Privacy in LLMs?

Large language models trained on vast web-scale datasets are prone to memorization attacks, where sensitive or personally identifiable information can be extracted from the model. Studies have shown that verbatim training data can resurface, especially in open-weight releases.

Differential Privacy offers a mathematical guarantee that prevents any single training example from significantly influencing the model. Unlike approaches that apply DP only during fine-tuning, VaultGemma enforces full private pretraining, ensuring that privacy protection begins at the foundational level.

https://services.google.com/fh/files/blogs/vaultgemma_tech_report.pdf

What Is the Architecture of VaultGemma?

VaultGemma is architecturally similar to earlier Gemma models, but optimized for private training.

A notable change is the reduction of sequence length to 1024 tokens, which lowers compute costs and enables larger batch sizes under DP constraints.

What Data Was Used for Training?

VaultGemma was trained on the same 13 trillion-token dataset as Gemma 2, composed primarily of English text from web documents, code, and scientific articles.

The dataset underwent several filtering stages to:

This ensures both safety and fairness in benchmarking.

How Was Differential Privacy Applied?

VaultGemma used DP-SGD (Differentially Private Stochastic Gradient Descent) with gradient clipping and Gaussian noise addition. Implementation was built on JAX Privacy and introduced optimizations for scalability:

The model achieved a formal DP guarantee of (ε ≤ 2.0, δ ≤ 1.1e−10) at the sequence level (1024 tokens).

How Do Scaling Laws Work for Private Training?

Training large models under DP constraints requires new scaling strategies. The VaultGemma team developed DP-specific scaling laws with three innovations:

    Optimal learning rate modeling using quadratic fits across training runs.Parametric extrapolation of loss values to reduce reliance on intermediate checkpoints.Semi-parametric fits to generalize across model size, training steps, and noise-batch ratios.

This methodology enabled precise prediction of achievable loss and efficient resource use on the TPUv6e training cluster.

What Were the Training Configurations?

VaultGemma was trained on 2048 TPUv6e chips using GSPMD partitioning and MegaScale XLA compilation.

The achieved loss was within 1% of predictions from the DP scaling law, validating the approach.

How Does VaultGemma Perform Compared to Non-Private Models?

On academic benchmarks, VaultGemma trails its non-private counterparts but shows strong utility:

These results suggest that DP-trained models are currently comparable to non-private models from about five years ago. Importantly, memorization tests confirmed that no training data leakage was detectable in VaultGemma, unlike in non-private Gemma models.

https://services.google.com/fh/files/blogs/vaultgemma_tech_report.pdf

Summary

In summary, VaultGemma 1B proves that large-scale language models can be trained with rigorous differential privacy guarantees without making them impractical to use. While a utility gap remains compared to non-private counterparts, the release of both the model and its training methodology provides the community with a strong foundation for advancing private AI. This work signals a shift toward building models that are not only capable but also inherently safe, transparent, and privacy-preserving.


Check out the PaperModel on Hugging Face and Technical Details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post Google AI Releases VaultGemma: The Largest and Most Capable Open Model (1B-parameters) Trained from Scratch with Differential Privacy appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

VaultGemma 差分隐私 Differential Privacy LLM 大语言模型 AI Google AI DeepMind 隐私保护 开源模型
相关文章