探讨大型语言模型（LLM）能力评估与研究的挑战

Published on September 29, 2025 10:21 PM GMT

On one hand, we want to make sure our threat model is accurate. Consider questions such as:

Examples

Why

Can we build an eval for that?

bad

How widespread is this problem in public evals?

Can we describe that problem in concrete technical terms? Or at least intuitive terms?Do we have reasons to believe it'd be easy/hard for the capability researchers to fix this problem? What are those reasons?What might they try to do to fix it, and why would they succeed/fail?Are there already any papers/known techniques which can clearly be adapted to solve this problem?

How does the trick A intervene on those internal structures?Does our model of LLMs' internals account for A improving LLMs' performance at X?What other predictions about LLMs' external behavior can we make from our model of what A does?If our model is confirmed, what other ways to intervene on the LLMs' internals it suggests, and what may be achieved this way?

What observables would be most credible?What experiments can we run to tell those hypotheses apart?

And so on.

Obviously we want to know of all those things! They directly bear on the questions of whether LLMs may become a threat, and how soon, and whether the current alignment techniques work, and what better alignment techniques may be effective, and whether we should invest in empirical or theoretical research, and whether we can use LLMs themselves for alignment R&D, and whether we can create legible evidence of LLMs' misalignment/danger/rapidly growing capabilities in order to convince the policymakers/the public/the capability researchers.

We obviously don't want to be flying blind, utterly oblivious to what's going to happen, unable to spot opportunities for effective interventions, unable to come up with an effective intervention to begin with.

On the other hand, talking about any of this is obviously dual-use. Capability researchers also want to know whether LLMs would be able to do X, and what might be currently impeding them, and how they can fix that problem (detailed technical suggestions and literature reviews appreciated!), and what's going on inside them, and how their insides can be modified to achieve desired results, and what benchmarks or external indicators are good or bad ways to track which techniques are useful and how much real progress is happening.

E. g., consider a world in which LLMs are in fact a dead end, but the capability researchers are in denial about it. In that world, it would certainly shorten the timelines if they were snapped out of their delusions.

What's the correct way to balance those considerations?

The extremist pro-speculation position would go like this:

The AGI labs are billion-dollar megacorporations employing world-class researchers. They have vast troves of internal research significantly ahead of the public SOTA, which gives them a considerably better vantage point. Anything we can think of, they have already thought of, and either checked it empirically, or have a whole team working on it full-time. As to our theoretical speculations, (1) they're probably wrong anyway and (2) nobody will probably read them anyway, so there's no harm in posting them.
We should do our best to improve our understanding of our situation and our ability to engage in up-to-date alignment research, and public discussion is most conductive to that.

The counterarguments would go:

very good

There are strong reasons to expect capability researchers at AGI labs to perform below what you'd intuitively expect, creativity/innovation-wise:

Large organizations only have so much attention to spare, only so many lines of inquiry they can pursue at full power.Large organizations tend to develop an internal culture, and so fall into groupthink-like failure modes. There are viral internal memes there, and a collective sense of promising/unpromising, popular/unpopular, high-status/low-status research directions.Research fields as a whole, similarly, are often dominated by memetic currents and what's popular/publishable, rather than what'd be actually productive.Experts in a given area often develop tunnel vision, especially if their expertise is narrow.

Research orgs often recognize those problems and attempt to fix them (e. g., OpenAI reportedly does), and obviously not all researchers fall into those failure modes. But those factors are still real.

pays a lot of attention to twitter

would

^[1]

Consider the following chain of reasoning:

making a mistake

~~less wrong~~

different good ideas

(I. e.: we should expect a failure to understand the AGI doom to correlate with other reasoning errors about AGI, which would in-expectation hamper capability progress as well.)

So I think worrying about posting capability-advancing exfohazards is a legitimate concern worth keeping in mind.

Which, again, raises the question: what's the optimal amount of talking about this? What are useful rules-of-thumbs here? Ideas welcome.

There are edge cases where posting something is obviously net negative, e. g., publishing your detailed AGI blueprint in order to win some argument. But what's the most concrete thing on the topic which is okay to post? Like, as I'd mentioned, even "I don't think LLMs scale to AGI" is arguably stupid to say, because what if it's true and you argue it so convincingly even capability researchers are persuaded?^[2]

^{^}
And there's of course some reason to believe this nudging process would point in the direction of the truth (though it might also be biased by memetic currents).
^{^}
And yes, that concern was already on my mind when I made that post, but I decided it's probably okay in that case, and also barely anyone will read it anyway, right?
I have, ahem, mixed feelings about it ending up with ~400 karma.

Discuss

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签