少点错误 前天 10:17
GPT-2 内部的距离测量机制探究
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文深入探讨了GPT-2模型在处理重复标记(tokens)时所采用的一种独特的距离测量机制。该机制通过引入“哨兵点”(sentry points)并利用ReLU激活函数,在特定条件下能够近似计算标记间的距离。文章详细阐述了该机制的数学原理,包括如何利用向量点积和函数导数来估计距离,并展示了其与通信领域双绞线电缆的相似之处。此外,作者还提出了该机制的潜在扩展应用,例如在区分相似方言等场景下的干扰消除。最后,文章将理论机制与GPT-2的实际实现进行了对比,指出GPT-2在多哨兵激活和利用位置嵌入的螺旋特性方面进行了优化。

💡 **核心机制:基于“哨兵点”的距离估算** 文章提出了一种新颖的距离测量方法,通过在函数域内设置一系列“哨兵点”来近似计算输入点之间的距离。当输入点`xi`和`xj`(分别对应索引`i`和`j`)距离较近时,通过计算`(xi - xj)`与特定哨兵点`sm`处函数导数`f'(sm)`的点积,并结合MLP层的ReLU激活,可以有效估算出`|i-j|`。这种方法类似于通信中的双绞线技术,通过差分信号来抵消噪声。

📐 **数学原理与近似:** 该机制的核心在于利用泰勒展开的近似性质,将距离计算转化为对函数导数和输入向量点积的运算。通过精心设计的MLP层,为每个哨兵点设置一对神经元,利用ReLU的特性,使得只有与输入点最近的哨兵点对应的神经元对能够激活,从而输出与距离相关的信号。文章详细推导了误差界限,表明该方法在特定条件下能够提供有界的距离估计。

🚀 **机制的扩展性与噪声鲁棒性:** 作者认为,这种“干涉消除”的思想不仅限于距离计算,还可以应用于其他需要区分相似信号的场景,例如区分不同的英语方言。该机制的优势在于,即使`f(sm)·xi`项存在噪声,由于噪声在两个相对的神经元激活中会相互抵消,因此对最终结果的影响较小,提高了方法的鲁棒性。

🐍 **GPT-2的具体实现优化:** 文章对比了理论机制与GPT-2的实际应用,发现GPT-2在实现上有所创新。GPT-2并非只激活一个哨兵点,而是可能同时激活多个,并且它利用了其位置嵌入矩阵呈现螺旋状的特性。这种螺旋特性使得函数导数`f'`可以被线性变换近似计算,从而允许复用哨兵点,避免了哨兵数量随输入序列长度`n`线性增长的问题。

Published on October 24, 2025 2:03 AM GMT

Overview:

There is an interesting mechanism GPT-2 seems to use to measure distances between duplicate tokens. The mechanism reminds me a lot of Twisted pair cabling in communications. 

The mechanism is fiddly to explain in context, so i've tried to abstract out most of the details, and give a clean toy version of the mechanism. I think some of the structures GPT-2 develops for this mechanism could be used in other contexts than computing distances.

Setup:

We have the set of points {xi}ni=0RdRdmodel, with xi=f(i/n) where f:[0,1]Rd is a smooth function. We take as input xixj with |ij|k<<n . We want to construct a transformer which can estimate |ij| given xi,xj. For this transformer, we additionally assume that d is relatively small compared to the embedding dimension dmodel.

The mechanism:

We set up M+1<n "sentry points" {si}Mi=0 uniformly along [0,1]. And we define g:[0,1][0,M]Z sending t[0,1] to the index of the closest sentry point.

Then we have, ||xixj(injn)f(sg(in))||2=||injnf(t)f(s)dt||2L(12M+|ij|n)|ij|n

where L is such that ||f(x)f(y)||2L|xy| for all x,y[0,1].

So |(xixj)f(sg(in))||f(sg(in))||22(ij)|||injnf(t)f(s)dt||2||f(sg(in))||2L||f(sg(in))||2(12M+2+|ij|n)|ij|n.

Therefore if we can approximate (xjxi)f(sg(in))||f(sg(in))||22,  then we can approximate ji.

Attention mechanism:

If we are given a two token input to the transformer xjxi, then assuming that ddhead, two attention heads is sufficient to compute xjxi (have one head which outputs -xi, and the other xj). We write xjxi to an orthogonal subspace from xi so that the MLP can cleanly access xi later.

MLP mechanism:

The MLP mechanism consists of 2M+2 neurons, with a pair of two neurons associated with each sentry point.

For each sentry point sm{0,1M+1,2M+1,,1}, we define a neuron pair:

Neuron+m=ReLU(bm,1f(sm)xi+bm,2+bm,3(xjxi)f(sm))

Neuronm=ReLU(bm,1f(sm)xi+bm,2bm,3(xjxi)f(sm))

where Wmbm,l are tuned so that bm,1f(sm)xi+bm,2>ϵ when m=g(in), and bm,1f(sm)xi+bm,2<0 otherwise. Additionally we set up bm,3(xjxi)f(sm) so that it has a magnitude of less than ϵ when |ij|k.

Example sentry neuron activations when xi=xj with M = 5. Each different colour corresponds to the activation of a different sentry neuron. We can pick M+1 coprime to n so that the sentries don't vanish at in for any i, and so they are bounded above by some ϵ>0. A signal of magnitude ϵ can be encoded in the difference between the activations of pairs of these sentry neurons.

Setting up these sentries relies on f not coming close to intersecting itself, so that the dot product with f(sm) is only high on a connected interval.

We wrote xjxi in an orthogonal subspace to xi so the sentry neurons can all be written in the standard form ReLU(wth0+b) where h0 is the residual stream post attention.

We then output Cmbm,3(Neuron+m-Neuronm)v from the mth neuron pair.

Under this construction Neuron+m and Neuronm always activate at the same time as each other, so the output of the mth neuron pair is 0 if mg(in), and Cm(xjxi)f(sm)v if m=g(in).

Since only a single neuron pair activates at once, the complete output of this MLP layer is Cg(in)(xjxi)f(sg(in))v

Then setting Cg(in) proportional to 1||f(sg(in))||22, we get an output proportional to (ji)v

Extensions of mechanism:

The above mechanism is clean, and captures the key ideas. A nice thing about the mechanism is that the f(sm)xi term can be noisy, but it doesn't matter because the common noise will get cancelled out, similar to twisted pair encoding. 

However, there can still be potential issues caused by noise at the transitions between sentries. It is also not robust to f intersecting itself, and the number of sentries required grows with n

There are cases where this kind of two neuron interference cancellation could come in useful outside of computing distance. For example, if you want to distinguish between British and Canadian English, you could have:

Neuroncanada=ReLU(Commonwealth+ϵ(Canadian))

Neuronbritish=ReLU(Commonwealth+ϵ(British))

And then take the difference between the two. The interference that would usually make it difficult to distinguish between the two very similar dialects gets cancelled out.

Though this is probably just PCA??

Preliminary GPT-2 specific mechanism:

GPT-2 seems to adopt a similar mechanism to the mechanism discussed above when calculating distances between the positions of duplicate tokens, but with a couple differences. Firstly, GPT-2 has multiple sentries active at once, which the mechanism above simplifies away. Secondly, instead of using sentries based on the value of f(in), it learns an f where it can cheaply compute an approximation for f.

 GPT-2's positional embedding matrix is a helix shows that positional embeddings lie on a helix. A property of helices is that we have f=Wf+c, which means that we can compute f(in)=yi=Wxi+c just by using an attention head to apply a linear transformation. Of course GPT-2's positional embeddings don't precisely lie on a helix, and this all holds up to approximation.

We can then set up sentries that use f(sm)yi instead:

Neuron+m=ReLU(bm,1f(sm)yi+bm,2+bm,3(xjxi)f(sm))

Neuronm=ReLU(bm,1f(sm)yi+bm,2+bm,3(xjxi)f(sm))

Because the positional embeddings of GPT-2 lie on a helix, f is periodic (f(t)=(t,cos(t),sin(t)) isn't periodic, but f(t)=(1,sin(t),cos(t)) is). This allows for the reuse of sentries across distant positions, and so we don't need to scale sentries with n necessarily.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

GPT-2 Transformer 距离测量 人工智能 深度学习 机制解析 Distance Measurement Artificial Intelligence Deep Learning Mechanism Analysis
相关文章