少点错误 09月27日
论研究自动化在AI领域的潜力与挑战
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

文章探讨了研究自动化在人工智能领域的可行性及其对AI能力和安全研究的影响。作者认为,尽管AI安全研究的价值验证和概念突破面临独特挑战,但这些困难并非AI安全独有,其他领域也普遍存在。文章提出,AI实验室有动力解决这些问题,以训练自动化研究员,从而可能加速AI安全研究的进展。同时,作者也指出,AI能力研究同样面临评估困境,并探讨了训练自动化研究员的可能方法,强调了通用性、概念思维和研究品味的重要性,认为实现研究自动化并非不可能,但仍需克服诸多障碍。

💡 **研究自动化的普遍挑战与AI安全研究的独特性**:文章指出,在任何研究领域,评估研究的价值(科学和实践影响)都极其困难,且许多领域都依赖于概念性突破。AI安全研究虽然在最终目标验证上难度更大,但在技术正确性、后续工作预期和特定子领域工具的实用性评估上,与其他领域并无本质区别。这表明AI安全研究的自动化并非因评估困难而变得不可能。

🚀 **AI能力研究与安全研究的自动化前景**:作者认为,AI实验室有强烈动机去解决训练自动化研究员的难题,因为这能带来巨大的进步。虽然AI能力研究的“直线向上”看似易于评估,但实际上也存在误导性评估的风险,如测试集污染或过拟合。AI安全研究的自动化潜力在于,即使无法直接评估最终的“安全性”,也可以通过其他可验证的中间步骤来训练模型,类似于其他领域的研究方法。

🧠 **概念性思维与研究品味的自动化**:文章强调,真正有影响力的研究往往瓶颈在于提出正确的问题和研究视角,而非单纯的技术执行。AI安全研究可能比其他领域更依赖概念创新,但这种对“研究品味”和“概念思维”的需求在所有研究领域都普遍存在。AI公司若要实现研究自动化,就必须解决如何训练模型具备这些能力,这同样也适用于AI安全研究的自动化。

🛠️ **研究自动化训练的可能途径**:文章探讨了如RLHF(基于人类反馈的强化学习)和基于结果的RL(Outcome-based RL)等训练方法。作者认为,单纯的RLHF可能导致“slop”(无用但看似良好的输出),而基于结果的RL在研究领域因时间周期长而难以直接应用。更有效的方法可能是结合多种训练程序,以塑造AI的认知,使其能够进行自我检查、考虑替代方案、做出预测并审视假设,从而实现更通用的研究能力,这对于AI安全研究同样适用。

Published on September 27, 2025 12:36 AM GMT

More people should probably be thinking about research automation. If automating research is feasible prior to creating ASI it could totally change the playing field, vastly accelerating the pace of progress and likely differentially accelerating certain areas of research over others. There's a big risk, though, that AI capabilities research might be much easier to automate than safety research. One reason this could be the case is that it's much harder to verify that safety research is actually valuable, since we can't safely try out our techniques on an ASI. A second reason is that alignment research might be more strongly bottlenecked on conceptual breakthroughs than capabilities research, and getting such breakthroughs out of an AI seems likely to be significantly harder than automating the "normal science" needed to advance AI capabilities.

I want to make a fairly narrow argument, that AI safety research isn't so drastically different from research in other fields: other fields also have difficulty verifying the value of their work, and many other fields are also bottlenecked on conceptual breakthroughs. These difficulties may be more extreme in AI safety than in some other fields, but they're almost always present, even in ML. Because of this, I expect the big AI labs to put considerable effort into figuring out how to train automated researchers despite these difficulties, raising the chances that we'll be able to automate significant amounts of safety research. I wouldn't say that this makes me hopeful, exactly: the AI labs could very well fail to solve these problems effectively before creating ASI. But it does make me slightly more hopeful.

How do we evaluate research?

I think generally there are two different axes along which we evaluate the impact of research: impact on other researchers, and impact on broader society. I'll call these "scientific" and "practical" impact, respectively. Scientific impact is essentially a measure of how much other impactful research you enable. Papers with no scientific impact don't get cited; papers with some impact might enable a few follow-up papers; more impactful papers might introduce new approaches, enable us to answer novel questions, or open up whole new sub-fields of research. The foundational papers of string theory might be examples of research with high scientific impact but no practical impact; a clinical trial for a scientifically well-understood drug might be an example of research with a large practical impact but minimal scientific impact. In practice, of course, most research will have a mix of the two, and they might be hard to disentangle.

I think accurately evaluating research along either of these axes is just really hard in general. To evaluate the scientific value of a paper we rely on the judgment of experienced researchers in the short term, and in the long term we just wait and see what subsequent research is produced. The former is fallible, while the latter takes a very long time. The process is also at risk of circularity, where communities of researchers nerd-snipe each other into increasingly esoteric and useless rabbit-holes.[1] Evaluating the practical impact of some research is often easier, but it can still be fraught. Unless you're working directly on applications, most research just takes a long time to reach the point where it has any practical use. And even when it does, there are plenty of examples of applied, empirical research that ends up having nowhere near the impact people originally expect (or even having a negative impact), because the world is complicated and you are not measuring what you think you are measuring.[2] Along both axes, the only really foolproof way to judge the value of some work is to just wait a while and see what comes of it.

In AI safety

In AI safety, it's only the end goal -- whether we can safely build ASI -- that's extremely hard to verify. Every other level of impact, along both axes, is not any harder to evaluate than in other fields. It's certainly no harder to judge a paper purely on the level of technical correctness. We can also judge with no more than the usual amount of difficulty how much follow-up work a paper might be expected to produce. On the practical side, it's not any harder than in other fields to tell whether a practical technique or tool is useful according to the assumptions and threat model of a particular sub-field. It's only the very last step, judging whether those assumptions and threat models are valid and whether the research will actually move the needle on safety, that's more difficult than in other fields. For instance, we can see that the AI Control paper was impactful because it spawned a new sub-field of research, while still being uncertain about whether it'll actually help reduce the probability of catastrophe.

Now, you can certainly complain that it doesn't matter how "impactful" something looks if it doesn't actually help us survive. But I think this doesn't really change the prospects of automating research. If we want to train a model to do good research, we can't judge the outputs of the model on the basis of its long-term impact: we'd need shorter feedback loops for training. We won't be able to stick "does this solve alignment" in an objective function, but we also can't stick "does this cure Alzheimer's" (for instance) in an objective function, because the timescale over which we'd have to judge that is just too long. So at least when it comes to training an automated researcher, AI safety doesn't seem much worse off to me than any other field.

It's certainly possible that we'll end up following our automated researchers down useless rabbit-holes because it's too hard to judge whether something actually helps with safety. But many other fields have this problem too, if perhaps not as severe as in AI safety, because the feedback loops are so long. (It's probably even worse in pure math: the field doesn't even have an end goal to work towards, and I don't think they pay much attention to what the applied mathematicians do with their results.) And also, the same danger applies to human AI safety researchers. If your position is that AI safety is almost impossible to evaluate in principle, and that we shouldn't build an ASI until we have a full mathematical proof of its safety -- fair enough, but that's not an argument against research automation in particular.

AI capabilities

That's all well and good, but maybe we will have the short-term feedback loops we need to train automated AI capabilities researchers? After all, capabilities research is just "line go up," which should be easy to evaluate.

Well, yes and no: the important question is not "did line go up" but "which line went up."[3] Even in ML, it's very easy to fool yourself and others (at least for a time) about the quality of your work: you might accidentally or "accidentally" train on the test set, overfit to a dataset, or pick a benchmark that makes your method look good but doesn't translate into useful performance in practice. We've seen these failure modes all the time with LLMs, for instance with ChatGPT's sycophancy, or Llama 4 and Chatbot Arena. The only ultimate measure of success for a model or new technique is whether other people use it or build on it in the long run. Evaluating work might be easier in ML than in most other fields, because the feedback loops are tighter, and you can fairly easily test something out on a range of different benchmarks to get a more holistic sense of how it performs. But it's still difficult; so if the AI labs want to automate capabilities research and aren't content with just making better coding agents, they'll have to address this issue. 

Conceptual research

A somewhat different argument I've heard about the difficulty of automating safety research is that AI safety is strongly bottlenecked on good conceptual thinking or taste, and that this will be hard to automate due to the long feedback loops needed. I think AI safety might be more bottlenecked on conceptual innovations than most other fields, but it's certainly not unique. Neuroscience, for instance, is often described as being "data rich but theory poor," with many calls for better theoretical frameworks to handle the "sea of data" being collected.

But more generally, I don't think good conceptual thinking is really confined to certain fields, or to certain types of research. Regardless of the field, I think in most cases the limiting factor to doing really impactful research is not in the technical work -- writing proofs or code, running experiments, etc. -- it's in coming up with the right questions to ask, or the right angle to approach a problem.[4] Plenty of research gets published that, while technically correct and maybe even technically impressive, is just generally useless and unimpactful (try asking most PhDs about their thesis). This includes ML: you need good research taste to pick the right metric to optimize. So I think the kind of skill you need to come up with really paradigm-shifting ideas is pretty contiguous with the kind of skill you need to do good research at any level, in any field: it's mostly a difference of degree, not kind.

If models don't improve much at ideation, research taste, and good conceptual thinking, they do seem likely to accelerate capabilities research somewhat more than safety. But even if the AI companies don't care about automating AI safety, they'll still have an incentive to solve these problems, because they show up in many domains. And I think there's a good chance that whatever techniques they come up with will let us automate safety research too.

Implications for automating research

What might research automation look like in practice? How might the AI labs try to train automated researchers?

One possibility would be something like RLHF: train a reward model on human judgments of research quality, then train a model to maximize the score it gets from the reward model. This probably doesn't go particularly well, in any field: you'll get outputs that might look good to the reward model (and maybe to you) but don't actually end up being useful, i.e. slop. (But again, I don't see it being any worse in AI safety than in other fields.)

What about outcome-based RL? We currently use this to train models on tasks we can easily verify, but doing this for entire research projects seems very hard: for almost all research, the time from ideation to actual real-world impact is way too long to package in an RL loop. You can't wait for a drug to go through clinical trials before applying a gradient update, for instance. And even in ML, you can't wait around for the next frontier training run to check whether the techniques you've come up with actually improve model quality in practice: you need to use more easily verifiable proxies of quality, and that introduces the risk of Goodharting on those proxies.

If the best we can do is outcome-based RL in some narrow domains where we can easily check output quality, I don't think we'll get very useful autonomous AI researchers. I expect humans would still largely guide the research process, telling AI assistants what questions to try to answer or what metrics to optimize. This would probably accelerate AI capabilities somewhat more than safety, but even ignoring safety, we'd be leaving a lot of value on the table. So I think there would be a strong incentive to go beyond this.

I think it's likely possible to build dangerous ASI while still only using this sort of simple outcome-based RL. AIs might just keep getting better at optimizing narrow metrics until one of them realizes it could optimize metrics better by tiling the galaxy with GPUs, or something. But I think if the AIs only get practice optimizing narrow metrics and don't have training for research taste or other skills needed to initiate and guide a research project, there's a decent chance this raises the bar. In other words, it seems plausible that, in order to get an AI capable of taking over the world without experience doing the kinds of thinking needed for really good research, you'd need a significantly bigger model than you would if it had such experience.

What might a solution to building good automated researchers actually look like? I don't have a concrete answer, and if I did I'm not sure I'd be writing about it here! But I want to make the case that it's probably not impossible, despite the fact that we lack a good objective function to optimize.[5] This is mainly because reward is not the optimization target. To train an AI to do good research, we shouldn't necessarily imagine trying to design a function that takes in a research output like a paper and has a global maximum at the best possible paper, and then optimizing that, and despairing because we don't have a non-Goodhartable way of judging research outputs. Rather, we should imagine trying to shape the AI's cognition to perform mental motions that are useful for research, and not perform detrimental ones: to check its work, consider alternate approaches, make predictions, examine its assumptions; to not just try to find the "right answer" or focus on passing the test. It's a problem of generalization: using some limited environments and datasets to instill behaviors that will generalize to doing good research in areas we can't easily judge. Broadly speaking, the fact that we can sometimes do this kind of generalization is why ML is useful at all. For instance, we don't really have a single reward function for "have good conversations with people," we have a weird combination: pre-training plus SFT plus RLHF, but prevent the RL from getting too far from the pre-trained prior because otherwise it goes off the rails, etc. Obviously this has its problems, but it works well enough most of the time. So maybe, if we do get good automated researchers, they'll be trained in a similarly convoluted manner, stacking together a couple different training procedures to get something that works well enough. Again, I don't think this is easy, but it doesn't seem unsolvable.

Conclusion

I've argued that some of the reasons people have for thinking AI safety might be particularly hard to automate aren't actually so unique to AI safety. It's difficult to evaluate impact in any field, and good conceptual thinking and taste are needed for almost all research. So to unlock most of the value of automating research, AI companies will have to find ways of training automated researchers despite these difficulties. There's no guarantee that they'll solve them in time: it's certainly possible that simple scaling (or good old-fashioned human research) gets us to ASI before we figure out how to get useful research out of weaker AIs. But I don't think the problems are impossible in principle, and I expect the AI labs will have a pretty strong incentive to solve them. This makes me somewhat less worried about not getting a chance to automate significant amounts of safety research before ASI, or about the gap between automated capabilities research and automated safety research growing too large. I'm still pretty worried -- there's a lot of uncertainty about how things might play out. But, somewhat less worried than I would be otherwise.

  1. ^

    My impression is that some people think string theory is an example of this. I don't know enough physics to have an opinion on the matter.

  2. ^

    Leaded gasoline, CFCs, ivermectin, and all the non-replicable work in psychology are some examples.

  3. ^

    You'll notice in the diagram above that both the x-axis and the y-axis are labeled with "layers," making this a prime example of optimizing the wrong metric.

  4. ^

    This isn't to downplay the importance of empiricism, good execution, and generally making contact with reality: it's often in the course of running experiments or tinkering with a problem that we come up with new ideas, and it's often hard to judge the value of an idea until we try to implement it.

  5. ^

    The fact that we can train human researchers without waiting for them to write and get feedback on hundreds of different papers shows that this is possible in principle, although the unreliability of producing good human researchers does point to its difficulty.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

研究自动化 AI安全 AI能力 人工智能 机器学习 Research Automation AI Safety AI Capabilities Artificial Intelligence Machine Learning
相关文章