语言模型计算步数动态调整新方法

cs.AI updates on arXiv.org 10月17日 12:13

语言模型计算步数动态调整新方法

../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

本文提出了一种名为“Catch Your Breath”（CYB）的监督训练目标，使语言模型能够根据输入动态调整计算步数。通过引入特殊的“pause”标记，模型可以在需要时请求额外计算资源，从而提高处理效率和准确性。研究了三种CYB损失变体：CYB-AP、CYB-VA和CYB-DP，并对比了它们在训练数据需求和性能上的表现。实验表明，CYB模型仅需基线模型三分之一的训练数据即可达到同等性能，并且能够根据token的复杂性和上下文智能地分配计算资源，例如在处理歧义性或复数词时增加计算量。

💡 **动态计算步数调整：** 论文提出了一种名为“Catch Your Breath”（CYB）的训练目标，使语言模型能够根据每个输入token的需求，自主地动态调整所需的计算步数，从而更有效地利用计算资源。

⏸️ **“Pause”标记机制：** 通过引入一个特殊的“pause”标记，模型可以在必要时请求额外的计算时间。当模型收到“pause”标记时，会在下一个输入步骤获得额外的计算资源，允许模型进行更深入的处理，并且可以多次请求。

⏳ **序列决策与成本校准：** 为了训练模型明智地使用“pause”标记并校准其不确定性，将每个“pause”标记的选择视为一个带有时间成本的序列决策问题。研究了三种损失函数变体：CYB-AP（随时预测）、CYB-VA（变分方法）和CYB-DP（计算预算惩罚）。

📊 **显著的训练效率提升：** 实验结果表明，CYB模型在达到相同性能水平时，所需的训练数据量仅为基线模型（无暂停）的三分之一，以及带有暂停和交叉熵损失模型的二分之一，显著提高了训练效率。

🧠 **智能的计算分配：** CYB模型能够根据token的复杂性和上下文智能地请求额外计算步骤。例如，它倾向于在复数名词（如“patients”、“challenges”）后暂停，避免在缩略词（如“wasn't”、“didn't”）的第一个token后暂停，并对歧义性token（如“won”）显示出高变异性。

arXiv:2510.13879v1 Announce Type: cross Abstract: We explore a class of supervised training objectives that allow a language model to dynamically and autonomously scale the number of compute steps used for each input token. For any token, the model can request additional compute steps by emitting a output. If the model is granted a delay, a specialized token is inserted at the next input step, providing the model with additional compute resources to generate an output. The model can request multiple pauses. To train the model to use outputs judiciously and to calibrate its uncertainty, we frame the selection of each output token as a sequential-decision problem with a time cost. We refer to the class of methods as $\textit{Catch Your Breath}$ losses and we study three methods in this class: CYB-AP frames the model's task as anytime prediction, where an output may be required at any step and accuracy is discounted over time; CYB-VA is a variational approach that aims to maximize prediction accuracy subject to a specified distribution over stopping times; and CYB-DP imposes a penalty based on a computational budget. Through fine-tuning experiments, we identify the best performing loss variant. The CYB model needs only one third as much training data as the baseline (no pause) model needs to achieve the same performance, and half as much data as a model with pauses and a cross-entropy loss. We find that the CYB model requests additional steps when doing so improves accuracy, and the model adapts its processing time to token-level complexity and context. For example, it often pauses after plural nouns like $\textit{patients}$ and $\textit{challenges}$ but never pauses after the first token of contracted words like $\textit{wasn}$ and $\textit{didn}$, and it shows high variability for ambiguous tokens like $\textit{won}$, which could function as either a verb or part of a contraction.

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签