Published on September 29, 2025 10:21 PM GMT
On one hand, we want to make sure our threat model is accurate. Consider questions such as:
Examples
- Can LLMs do X?Will LLMs ever be able to do X?How fast will LLMs' performance at the X task improve?Why does LLMs' performance at the X task improve?What concrete observation would make you update towards LLMs ultimately being able to do X?
- Can we build an eval for that?
- How widespread is this problem in public evals?
- Can we describe that problem in concrete technical terms? Or at least intuitive terms?Do we have reasons to believe it'd be easy/hard for the capability researchers to fix this problem? What are those reasons?What might they try to do to fix it, and why would they succeed/fail?Are there already any papers/known techniques which can clearly be adapted to solve this problem?
- How does the trick A intervene on those internal structures?Does our model of LLMs' internals account for A improving LLMs' performance at X?What other predictions about LLMs' external behavior can we make from our model of what A does?If our model is confirmed, what other ways to intervene on the LLMs' internals it suggests, and what may be achieved this way?
- What observables would be most credible?What experiments can we run to tell those hypotheses apart?
And so on.
Obviously we want to know of all those things! They directly bear on the questions of whether LLMs may become a threat, and how soon, and whether the current alignment techniques work, and what better alignment techniques may be effective, and whether we should invest in empirical or theoretical research, and whether we can use LLMs themselves for alignment R&D, and whether we can create legible evidence of LLMs' misalignment/danger/rapidly growing capabilities in order to convince the policymakers/the public/the capability researchers.
We obviously don't want to be flying blind, utterly oblivious to what's going to happen, unable to spot opportunities for effective interventions, unable to come up with an effective intervention to begin with.
On the other hand, talking about any of this is obviously dual-use. Capability researchers also want to know whether LLMs would be able to do X, and what might be currently impeding them, and how they can fix that problem (detailed technical suggestions and literature reviews appreciated!), and what's going on inside them, and how their insides can be modified to achieve desired results, and what benchmarks or external indicators are good or bad ways to track which techniques are useful and how much real progress is happening.
E. g., consider a world in which LLMs are in fact a dead end, but the capability researchers are in denial about it. In that world, it would certainly shorten the timelines if they were snapped out of their delusions.
What's the correct way to balance those considerations?
The extremist pro-speculation position would go like this:
The AGI labs are billion-dollar megacorporations employing world-class researchers. They have vast troves of internal research significantly ahead of the public SOTA, which gives them a considerably better vantage point. Anything we can think of, they have already thought of, and either checked it empirically, or have a whole team working on it full-time. As to our theoretical speculations, (1) they're probably wrong anyway and (2) nobody will probably read them anyway, so there's no harm in posting them.
We should do our best to improve our understanding of our situation and our ability to engage in up-to-date alignment research, and public discussion is most conductive to that.
The counterarguments would go:
- If a theoretical speculation is posted because you assume it's either wrong or that it won't change anyone's mind, there's no reason to post it. The only worlds in which they're worth posting are those where those speculations are correct and will end up sending ripples. So you should be actively thinking about whether the tradeoff is worth it in any given case.
- Also, LLM models' context windows and internet-research abilities grow day by day. Are you quite sure your speculation won't be dumped into the 1B-token context window of a 2027 LLM, which was tasked with searching for good overlooked capability ideas, and which is very good at needle-in-a-haystack?
There are strong reasons to expect capability researchers at AGI labs to perform below what you'd intuitively expect, creativity/innovation-wise:
- Large organizations only have so much attention to spare, only so many lines of inquiry they can pursue at full power.Large organizations tend to develop an internal culture, and so fall into groupthink-like failure modes. There are viral internal memes there, and a collective sense of promising/unpromising, popular/unpopular, high-status/low-status research directions.Research fields as a whole, similarly, are often dominated by memetic currents and what's popular/publishable, rather than what'd be actually productive.Experts in a given area often develop tunnel vision, especially if their expertise is narrow.
Research orgs often recognize those problems and attempt to fix them (e. g., OpenAI reportedly does), and obviously not all researchers fall into those failure modes. But those factors are still real.
"Memetic currents" specifically, I expect, have a very strong effect. What gets talked about is what capability researchers are often prompted to think about, and e. g. OpenAI apparently "pays a lot of attention to twitter". So if the AI Twitter decides that Problem A is the core reason LLMs aren't taking off yet, this might well draw the AGI labs' limited attention to that possibility. (Possible examples: "agency", "continual learning".)- In more detail: Suppose you post some claim, like "LLMs don't scale to AGI" or "long-term memory is what LLMs are missing", and suppose that claim is actually true. Those possibilities are of course something capability researchers are well aware of, but there are dozens of concerns like this, and only so many research directions to pursue at once. So any given capability researcher who happens to read your argument may not be focused on that. And your arguments probably won't persuade them, either. But if you're raising good points, they would ever-so-slightly nudge this researcher in the right direction. And given enough nudges like this...[1]
Consider the following chain of reasoning:
- To begin with, the people who are against accelerating AI capability progress are against it because they think the AGI labs are making a mistake.This implies that there's some belief about AI where we are correct and the capability researchers are, collectively, wrong.This implies somewhat different, or significantly different, models of AI/minds/agents.We, of course, believe that our model is the
(I. e.: we should expect a failure to understand the AGI doom to correlate with other reasoning errors about AGI, which would in-expectation hamper capability progress as well.)
So I think worrying about posting capability-advancing exfohazards is a legitimate concern worth keeping in mind.
Which, again, raises the question: what's the optimal amount of talking about this? What are useful rules-of-thumbs here? Ideas welcome.
There are edge cases where posting something is obviously net negative, e. g., publishing your detailed AGI blueprint in order to win some argument. But what's the most concrete thing on the topic which is okay to post? Like, as I'd mentioned, even "I don't think LLMs scale to AGI" is arguably stupid to say, because what if it's true and you argue it so convincingly even capability researchers are persuaded?[2]
- ^
And there's of course some reason to believe this nudging process would point in the direction of the truth (though it might also be biased by memetic currents).
- ^
And yes, that concern was already on my mind when I made that post, but I decided it's probably okay in that case, and also barely anyone will read it anyway, right?
I have, ahem, mixed feelings about it ending up with ~400 karma.
Discuss
