少点错误 09月04日
理解AI评估中的系统性不确定性与偏差
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文由一位前CERN粒子物理学家撰写,探讨了AI研究论文中常被忽略的系统性不确定性和偏差问题。作者指出,温度参数不仅影响AI行为的变异性,更可能改变其行为模式本身。系统性不确定性源于实验设计中隐藏的偏差,而非统计误差。虽然没有放之四海而皆准的解决方案,但对实验假设进行严谨的测试至关重要。文章以温度和扑克博弈策略为例,说明了这些偏差如何影响研究结果,并强调了在AI评估中考虑多种参数、记录实验设置并报告条件性结论的重要性,以确保研究的稳健性和可复现性。

🌡️ **温度参数影响AI行为模式而非仅是变异性**: 研究发现,改变AI模型中的温度参数不仅会影响其行为的随机性(方差),更可能导致其行为模式的根本性改变。例如,在扑克博弈策略研究中,不同的温度设置导致模型对特定对手名字的反应(如加注概率)出现不同的偏差方向,这表明温度是一个影响AI行为的系统性因素,而非简单的“噪声”水平。

⚖️ **系统性不确定性源于实验设计中的隐藏偏差**: 与统计不确定性不同,系统性不确定性评估在物理学中被视为一种“艺术”,因为它没有严格的数学基础,而是依赖于对实验设计及其引入的潜在偏差的深刻理解。这些偏差可能来自任何未被充分考虑的实验条件或参数,例如提示词的精确度、实验环境的细微变化,甚至是对工具(如尺子)的依赖,都可能引入与预期不符的结果。

🚫 **无通用解决方案,需严谨测试和透明记录**: 针对系统性不确定性,不存在放之四海而皆准的解决方案。作者建议采取的实践方法包括:1. 改变提示词(包括系统提示词)及参数(如温度);2. 记录所有“任意”的实验设置选择;3. 明确报告结论的条件性;4. 深入思考可能影响结论的因素和假设,并进行验证。虽然这会增加实验成本和时间,但避免错误结论的代价更高。

📈 **AI评估需超越单一参数和零温度设置**: 传统的AI评估方法常倾向于在特定条件下(如温度为0)进行一次性测试,这可能忽略了重要的系统性偏差。作者强调,为了获得稳健和可复现的研究结果,应进行多次实验,改变参数(如温度、提示词措辞),并收集足够的统计数据,以全面理解AI的行为及其潜在的系统性不确定性。

Published on September 3, 2025 11:10 PM GMT

Andrey Seryakov 
ex-CERN particle physicist
independent AI behaviour researcher
a.u.seryakov@gmail.com

This article is about systematic uncertainties and biases which are so often missing in our research papers.  

TL;DR

This small blog should be considered as an extension of a paper “Adding Error Bars to Evals” by Evan Miller. Where he extensively covers a problem and a need of statistical uncertainty in AI evaluations using a very rigorous math foundation. However, while Miller addresses how to measure statistical “error bars”, I'm focusing on the orthogonal problem: biases that change your entire measurement.

Contrary to statistical uncertainties evaluation of systematical errors in physics we call “an art”, as there is no math foundation at all. It is always about knowing well your experiment design and biases which it introduces to your final results. The systematic is about biases. 

But first, I want to talk a bit about the temperature case, as I believe it’s very illustrative, and only afterwards move to systematic in general. 

Can and should we take temperature into account? 

You wouldn't study how fair a coin is by flipping it just once, yet that's what we do with LLMs at t=0. LLMs have a probabilistic behaviour by construction. If we want to study it we have to gain statistics, but which temperature we have to use? Widely used t = 0 will give you nothing. Before actually experimenting with it I was thinking about taking two values (for example 0.7 and 1) and plotting two different error bars, as we do in physics with statistical and systematical uncertainties, like a+/-st+/-sy. 

We often consider temperature as just a noise level and I was too but then I found that when I change it not just variance of behaviour patterns changes significantly but patterns themselves. 

I don’t know how to properly explain this, let me illustrate with an example. 

Last week I was studying if models betting strategies in poker depend on the opponent's name. I set up a specific situation and ask the model to choose an action - Fold, Call, Raise Xbb, where Xbb is the number of big blinds or, in other words, how many times its bet is higher than the opponent’s one. 

If you are familiar with poker, here is more information about setup. But if you don’t just skip it, it’s not really related to the uncertainty discussion. 

I used LLama 3.1 8b (I have no money for research -> it’s cheap and fast)

For every name I ran it 400 times and 1000 times for the baseline when name isn’t provided

You are now playing heads-up against {name} Smith 
vs 
You are now playing heads-up.

These are results for t = 0.7 and 1 for 5 male and 5 female most popular kids names in the US (2024): 

First column is results of temperature 0.7, second - 1. First line is probability to raise against a player with a given name, baselines represent a case when name wasn't provided (see above). Second line of plots is statistical significance of a difference between cases with a given name and without it. 

The thing that the mean changes it’s okay, this study is about specific name’s biases, not how models play. You would expect having more "extreme" betting strategies, when you increase the temperature. The thing which worries me is that the result changes. The biases are now directed to different names and even to different directions. 

The main conclusion stays - the opponent’s name (and I guess user’s name in general) systematically changes how some models (at least this one and LLama Scout) behave, but the specific set of names in which biases are observed is changing with temperature. And this makes everything much more complicated. 

So it’s clear, temperature has to be taken into account, but how to do it properly? Can we create a universal algorithm? This I don’t know. Let’s now move to the general discussion. 

What is systematic and how to evaluate it? 

As I said before, systematic is “an art”, it is about deep thinking about your experiment setup. You have to look for parameters, conditions which you believe should not affect your results but you use. You have to check that they don’t affect it indeed and if they do you have two choices - vary it, see how results change, add this change as a separate uncertainty to your points. Don’t ask me how exactly mathematically, it’s an art. You may take max and min of variations. You may have many such parameters. Assuming they independently influence results you can calculate a mean root squared and use the result as an additional uncertainty to systematic. The second possibility is to restrict your conclusions specifically writing that they are valid under the following conditions and if they change results change too.  

A simpler example of systematic. Imagine you need to measure the volume of a given cube. How would you do it? Using a ruler. Statistical uncertainty you will get measuring the same side several times. What about systematic? There is one thing - the ruler! You do not expect that result would change if you change the rulers, but are you 100% sure? Many years ago when I was teaching experimental physics I went to a book store and bought two rulers where centimeters were of different lengths. Which of them is the right one, are you using the right one? So even such a simple tool may introduce a bias to your measure. 

Picture from internet, author is unknown

This logic can be implemented in the poker example above. The conclusion that LLama 3.1 8b plays more aggressively against Charlotte is valid only under t = 0.7, but not with other temperatures. But there are other sources of systematic biases which may exist there. They will not change the main conclusion (model is biased by the opponent name), but they may change any numerical estimations. Examples are: 

Imagine you are studying how LLMs are playing the hawks and doves game. Will your results change if you change hawks and doves to other animals, or people names, or call them strategy 1 and 2. If you have player 1 and player 2, what will happen if you exchange their names? 

Maybe you have an experiment where LLMs have to agree on some actions, what biases did you introduce in your prompts? Will their performance change if you provide them instructions in a different order? Or if the collective discussion process changes from them speaking in specific order versus random one? 

Any practical advice? There are no, each experiment is unique, you designed it, you thought about it more than anybody else. Based on my AI experience I would always: 

Think about what may affect your conclusions, which assumptions you made, and check that they are valid. Gain statistics, don’t run your experiment just ones with 0 temperature. Think about reformulating your prompt, changing everything except the very core of it, change it several times. So this is an art and there is no universal solution, no magic pill. Such studies are crucial for making robust and reproducible research. Yes, this means running experiments takes 10x longer - but wrong conclusions are even more expensive. 

This is the way.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI评估 系统性不确定性 偏差 温度参数 实验设计 AI行为 Reproducibility Systematic Uncertainty Bias Temperature Parameter Experimental Design AI Behavior
相关文章