Character AI Blog 前天 01:17
Character.ai 发布 Kaiju LLM 系列,注重效率与安全性
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Character.ai 团队宣布推出其自研的大型语言模型(LLM)系列 Kaiju,该系列模型专为提升速度、参与度和安全性而设计。Kaiju 系列包含三种尺寸(13B、34B、110B),融合了密集 Transformer 架构与 int8 量化、多查询注意力(MQA)、滑动窗口注意力以及跨层 KV 共享等高效优化技术。这些创新旨在显著降低推理成本和内存占用,同时保持模型的高性能和对话能力。Kaiju 模型在训练和部署过程中还采用了量化感知训练(QAT)和梯度压缩等技术,并在安全对齐方面进行了多阶段的优化,包括监督微调和基于用户反馈的强化学习。

🌟 **Kaiju LLM 系列注重效率与安全性**:Character.ai 推出的 Kaiju 系列 LLM,包括 13B、34B 和 110B 三种尺寸,核心目标是实现快速、引人入胜且安全的 AI 交互。通过采用诸如 int8 量化、多查询注意力(MQA)以及滑动窗口注意力等先进技术,Kaiju 模型在推理效率和成本效益方面取得了显著提升,使其能够支持大规模部署。

💡 **创新的架构设计优化推理性能**:Kaiju 模型在架构上采用了多项关键创新。多查询注意力(MQA)和跨层 KV 共享有效减小了 KV 缓存的大小,显著提升了推理效率,特别是在对话场景下。滑动窗口注意力则减少了长上下文处理中的计算量,同时通过与全局注意力的结合,在内部测试中保持了长上下文检索的质量,确保了模型的连贯性和信息捕捉能力。

🚀 **量化感知训练与梯度压缩提升训练效率**:为了在模型质量和训练成本之间取得平衡,Kaiju 模型采用了量化感知训练(QAT),使得模型在 int8 精度下仍能保持接近 bf16 的准确度,并加快了 20-30% 的训练速度。此外,利用 Squinch 算法进行的 6-bit 低比特梯度通信,进一步压缩了通信开销,构建了一个可扩展的学习系统。

🛡️ **多阶段安全对齐确保负责任的 AI**:在模型部署前,Kaiju 系列经历了严格的多阶段安全和对齐流程。这包括使用高质量的(安全相关、指令遵循)数据进行监督微调(SFT),以及利用用户反馈数据和评分进行强化学习(modified online DPO)。部分 Kaiju 模型还配备了可选的分类器头部,可以输出关于输入安全性的 token 级指标,并支持分类器引导的束搜索,以增强推理时的安全性。

As the Character.ai team shifts towards building on top of Open-Source models, we wanted to share the work that went into some of our OG research. After all, our founder Noam Shazeer invented the Transformer!

Kaiju is Character.ai’s in-house family of LLMs built specifically to be fast, engaging, and with an eye towards safety. 

Available in three sizes, Kaiju combines a dense transformer architecture with aggressive efficiency optimizations, including int8 quantization, multi-query attention, sliding-window attention, and cross-layer cache sharing. Previous blog posts mention some of these (and more): https://blog.character.ai/optimizing-ai-inference-at-character-ai/ and https://blog.character.ai/optimizing-ai-inference-at-character-ai-part-deux-2/.

If you’re an engineer interested in building the next generation of Character.ai models and this work sounds interesting to you, check out our open roles

Model Overview

The Kaiju family of models comes in 3 production variants: Small (13B), Medium (34B), and Large (110B).

The Kaiju models are heavily optimized for engaging conversation and serving efficiency, and those elements drove the design philosophy, rather than a focus on academic benchmarks.

Architecture Innovations

All Kaiju models are dense, transformer-based autoregressive LLMs with several unique architectural components.

Multiquery Attention (MQA)

Kaiju relies heavily on MQA reducing the per-token KV cache size and improving our inference efficiency. Chat inference workloads can typically rely heavily on KV cache hit rate due to the similar input token characteristics from one turn to the next, and with a smaller per-token KV cache size, this dramatically improves performance.

MQA is known to have a measurable, negative impact on some AGI benchmarks like MMLU - this is both publicly documented and reproduced internally by our team. Since we are not optimizing for AGI we found that the inference efficiency was well worth it when traded off against small quality impact.

Sliding Window Attention

The Kaiju production models utilize sliding window attention, which reduces the flops required for attention, especially in longer context settings.

All Character.ai models interleave sliding window and global attention layers. For current production models, this is done in a roughly 6:1 ratio of sliding to global attention, and the sliding window is 1024 tokens long.

Naive sliding window attention causes a drop in model quality on long contexts. In internal experiments on interleaved sliding window attention, there was little to no drop in “needle in the haystack” long-context retrieval quality.

It’s also worth noting that our current sliding window attention does not implement attention sinks.

Cross Layer KV Sharing

In addition to MQA, Kaiju models share KV cache between adjacent layers with the same attention mechanism. Similar to MQA, this allows for a decrease in the KV cache size required for inference and does not lead to a measurable drop in model accuracy. Generally, 2-3 layers share a KV cache.

Int8

The current family of Kaiju models stores their parameters and KV values in int8. At inference time, matrix multiplications are done in int8. On most modern accelerators, int8 matrix multiplication has 2x the flops of bf16.

Note: Kaiju models are all currently trained via Quantization Aware Training. Using QAT allows the models to maintain bf16-level model accuracy while training 20–30 % faster. 

Additional Innovations

Pre-layer norms - Kaiju models use pre-layer normalization. This means they apply RMSNorm to the input of each layer — before the layer’s main matrix multiplications — rather than applying normalization after the layer’s computations. In other words, normalization happens at the start of each layer instead of at the end.

Dynamic Clamping - Dynamic clamping of activations helps ensure stability during training. The model “learns” to utilize the clamping, and it is needed at inference time.

Model Training

Beyond architectural efficiency, Kaiju’s performance depends heavily on its training stack. Quantization-aware training, low-bit gradient communication, and stability enhancements together form the foundation of Kaiju’s scalable learning system.

Kaiju models were trained entirely on H100 GPUs in GCP clusters using model parallelism, which includes tensor + sequence within nodes and FSDP across nodes.

Quantization Aware Training

Kaiju models are trained using a variety of precisions to balance model quality and training cost.

Int8 - Forward pass weights, KV
Bf16 - Activations, local gradients
Fp32 - Gradient accumulations, FSDP master weights

Gradient communication is done in 6-bits using Squinch.

Gradient Compression (Squinch)

Squinch is a novel blockwise gradient compression algorithm that seeks to minimize the expected log-error of gradient reconstruction. Each block contains 8 elements, and the distribution of gradient magnitudes is modeled as log-uniform over a finite domain.

Additional Efficiency Innovations

Virtual scalars (Bungee) - In order to stabilize int8 training, virtual scalars are introduced to allow the model to express a wider range of activations and gradients. This is mostly helpful for smaller models.

Ternary Weight Updates - When training small int8 models, where the full int8 weights fit on the node, weights can be pinned to the node, like zero-2. When the magnitude of int8 weight updates is small, a 0, 1, or -1 can be sent representing each weight, compressing the weight broadcast to 1.6 bits/parameter.

Data Strategy

Kaiju models are trained on optimized data mixes. There are two categories of data mix objectives:

    MMLU Max - These data mixes are designed to maximize “AGI Benchmarks”.Production Max - These data mixes are designed to create a highly engaging model. 

In general, the methodology involves selecting a pre-training data mix that is as similar as possible to the task being optimized for (e.g. similarity via T5 embedding).

Kaiju models are trained on a broad mix of web-scale text, code, and synthetic data. Each variant uses a slightly different balance depending on its goal - for example natural, high engagement conversation requires different inputs than a model trained for benchmark performance.

We perform an annealing process near the end of the pretraining run, scheduling the MMLU Max section and other instruction data. This boosts the final performance of the models as it unlocks instruction following and specific knowledge for benchmark tasks. 

Safety and Alignment

Before deployment, Kaiju models undergo a multi-phase safety and alignment process, including:

    Supervised Fine-Tuning on high-quality (safety-related, instruction following) dataReinforcement Learning (modified online DPO) on user swipe data and feedbackClassifier training

Notably, Kaiju models come with an optional additional classifier head. The classifier head is a linear layer that outputs token-level metrics about the safety of the input along various dimensions.

While the Kaiju models can be used with any traditional sampling method, we implement classifier-guided beam search, where the classifier results are used to augment how we sample tokens at inference time. 

The Future of Safety-centric, Scalable AI

Kaiju demonstrates that production performance—not just benchmark scores—can and should drive architecture choices. Techniques such as int8 QAT, MQA, and KV sharing collectively reduce inference memory and cost by orders of magnitude, enabling large-scale deployment. 

As we focus on OSS LLMs into the future, we’ll continue to push towards our goals of efficient deployment, dynamic and engaging conversation, and robust safety and alignment.

Character.ai’s team works across model architecture, safety alignment, and production infrastructure at the cutting edge of interactive AI. If you’re an engineer or researcher who thrives on contributing to large-scale, human-centered ML systems, check out our open job posts HERE. We’d love to have you join our team! 

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Character.ai Kaiju LLM 大型语言模型 AI效率 模型优化 安全性 Transformer LLM AI Research Open Source
相关文章