少点错误 17小时前
用MNIST实验指导AI安全研究
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文提出了一种在选择有影响力的技术性AI安全研究项目时,尽管存在缺陷但仍有帮助的研究思维模式:即“全尺度”思维。该思维模式认为,优秀的AI安全研究方法不仅要能扩展到未来的强大AI系统,也要能缩小到像MNIST这样简单的数据集上。通过在MNIST上进行快速、低成本的实验,研究人员可以验证想法的普适性,避免被现代大型语言模型的偶然特性所误导,从而提高研究的效率和可靠性。文章详细阐述了该思维模式的实践方法、益处,并结合了梯度路由、蒸馏鲁棒性非学习和潜意识学习等AI安全研究案例,说明了MNIST实验在其中发挥的关键作用。同时,文章也指出了该思维模式的局限性,强调了在更大规模上进行验证的重要性,并提供了相关代码和进一步研究的方向。

🚀 **“全尺度”思维模式:** 提出一种在AI安全研究中选择项目的方法,核心思想是“好想法是普适的”。这意味着一个有效的AI安全研究方法应该能够从小规模(如MNIST数据集)和简单模型上得到验证,并能扩展到未来更强大的AI系统。这种方法有助于研究人员快速迭代,验证概念,并避免被特定于当前模型的特性所误导。

💡 **MNIST实验的价值:** MNIST作为一个包含70,000张手写数字图像的简单数据集,非常适合进行快速、低成本的实验。文章列举了包括卷积神经网络(ConvNets)、对抗性样本、变分自编码器、Dropout、Adam优化器、生成对抗网络(GANs)和“彩票假说”等在内的多项深度学习基础性研究都曾受益于MNIST实验。在AI安全领域,梯度路由、蒸馏鲁棒性非学习以及潜意识学习等研究项目也通过MNIST实验获得了早期验证和信心。

⚠️ **局限性与注意事项:** 文章明确指出,“全尺度”思维模式并非万能。MNIST实验的成功仅是必要条件而非充分条件,最终仍需在大规模、更真实的场景中进行验证。过度依赖单一数据集或模型可能导致“研究过拟合”。此外,某些重要的AI安全特性(如AI进行AI研究的能力)可能仅在足够大的系统规模下才会显现,无法在MNIST上进行有效研究。研究人员应根据自身研究方向和优势,灵活运用此思维模式。

🛠️ **实践与代码支持:** 文章提供了Python代码和Colab笔记本,方便研究人员快速进行MNIST实验。作者建议的研究方向包括路径依赖、泛化性、模块化架构、非学习、不确定性量化、半监督学习、表示学习和可解释性等。通过理解研究方法所依赖的底层结构,可以引出更多关键性的研究问题,例如如何验证该结构的存在性,是否能创造该结构,以及在结构缺失的情况下方法是否仍有效等。

Published on November 8, 2025 7:42 PM GMT

In this post, I describe a mindset that is flawed, and yet helpful for choosing impactful technical AI safety research projects.

The mindset is this: future AI might look very different than AI today, but good ideas are universal. If you want to develop a method that will scale up to powerful future AI systems, your method should also scale down to MNIST. In other words, good ideas omniscale: they work well across all model sizes, domains, and training regimes.

The Modified National Institute of Standards and Technology database (MNIST): 70,000 images of handwritten digits, 28x28 pixels each (source: Wikipedia). You can fit the whole dataset and many models on a single GPU!

Putting the omniscaling mindset into practice is straightforward. Any time you come across a clever-sounding machine learning idea, ask: "can I apply this to MNIST?" If not, then it's not a good idea. If so, run an experiment to see if it works. If it doesn't, then it's not a good idea. If it does, then it might be a good idea, and you can continue as usual to more realistic experiments or theory.

In this post, I will:

    Share how MNIST experiments have informed my research;Give a conceptual argument for adopting the omniscaling mindset;Give tips and words of caution for applying the omniscaling mindset;Explain why the omniscaling assumption is false;Provide starter code for running fast MNIST experiments; andExplain that this is about research taste.

Applications to MNIST

The strongest argument for testing your ideas against MNIST is empirical. Experiments on MNIST have helped to establish ConvNets, the lack of privileged basis in activation space and existence of adversarial examples, variational autoencoders, dropout, the Adam optimizer, generative adversarial networks, the lottery ticket hypothesis, and various perspectives on generalization. These are foundational results in deep learning. 

"But what about AI safety?" you ask. 

At the time of writing, I've been a core contributor to three research projects related to AI safety. MNIST played an important role in each.

Gradient routing

Note: gradient routing aims to create models with internal structure that can be leveraged to enable new safety techniques.

Gradient routing was conceived in the middle of the night. Only a few hours later, it saw its first experimental validation, which was on MNIST. A polished but essentially unchanged final version of the experiments is shown in the figure below.

A figure from the gradient routing paper, showing (a) a neural net training setup for creating split representations of MNIST digits, and (b) the resulting losses when decoding from these representations. Note: this result requires L1 regularization applied to the encoder's output. In general, gradient routing doesn't require L1 regularization to work. So, this experiment probably isn't the best example of gradient routing to keep in mind.

The experiments showed that it's possible to induce meaningful specialization within a model by controlling what parts of the model are updated by what data. Furthermore, this can be done without a mechanistic understanding of, or specification for, the algorithm implemented by the model.

These early experiments, plus conceptual arguments for the promise of gradient routing, gave my MATS 6 teammates[1] and I the conviction to go all-in on the idea. Over the course of a few months, we expanded on these results to include language model experiments on steering and robust unlearning, and reinforcement learning with imperfect data labeling. While doing so, we were sometimes able to scout ahead by running preliminary versions of experiments on MNIST. These experiments accumulated into a 30-page google doc.

The initial work on gradient routing spawned a research agenda and several research projects (e.g. this, with more coming quite soon!). Fabien Roger said "I would not be shocked if techniques that descend from gradient routing became essential components of 2030-safety."

Distillation robustifies unlearning

Early in this project, my MATS 7 teammates[2] and I became convinced that distillation robustifies unlearning. For context:

In other words, by training a randomly initialized model to imitate an unlearned model, we were pretty sure we could create a model that was truly ignorant of the originally unlearned knowledge. Building a case for this would take some work, but the path forward was reasonably clear.

However, we wanted to be able to say more than this! In the course of brainstorming for the project, we'd gained some intuition that unlearning based on finetuning was fundamentally incapable of producing robust unlearning. This was in part due to empirical evidence, but also informal conceptual arguments. In short, finetuning can remove that which is sufficient to produce some behavior (by eliminating the behavior). However, it's unclear how it could ever reliably remove that which is necessary specifically for the behavior.

So we didn't merely want our paper to say "hey, we found way to make unlearning robust." We wanted to say "look, all 'unlearning' methods are untenable, because they deal with dispositions, not capabilities. But also, it is because of this insight that we can achieve robustness. It is because of this insight that we can see that distillation robustifies unlearning-- by copying behaviors without copying latent capabilities."

Experiments for this paper involved pretraining language models, making them slow to iterate on. So, @leni and @alexinf scouted ahead with MNIST experiments to interrogate our intuitions. They got results quickly. These are shown in the figure below.

Results from preliminary experiments from MATS 7. A model (green) trained to perfectly imitate a model never trained on certain digits (yellow), nevertheless learns those digits much more quickly. It even learns more quickly than a randomly initialized model (red).

The results are striking. As they put it in their experiment writeup:

The setting is MNIST classification, with retain and forget sets defined as disjoint sets of digits (we start with retain={0,1}, forget= {2,3,...,9}) [... omitted more explanation...] In this experiment:

    We finetune a “base model” to be behaviorally equivalent (i.e. has almost equal outputs)[3] as a data filtered model on both retain and forget datasets.We view this finetuned base model, which we call the oracle-suppressed modelas a kind of “best possible shallow unlearning/suppression”, because you have access to the oracle model.[4]Still, this model is much easier to finetune on forget when compared to the data filtered model and a randomly initialized baseline. It still has information of the forget set in its weights.Key takeaway: There is likely no behavioral signature in terms of outputs on retain/forget that you can hope to achieve that will ensure you have robustly unlearned what you want to unlearn.

These positive results increased our confidence and set up the rest of the team to execute the experiments for our conference paper submission, including language model experiments that showed essentially the same results. The paper (Lee et al., 2025) was accepted as a "Spotlight" at NeurIPS, a machine learning conference. Daniel Kokotajlo said "... if this becomes standard part of big company toolboxes, it feels like it might noticeably (~1%?) reduce overall AI risks."

Subliminal learning

In the subliminal learning project, my teammates[5] obtained mind-bending results early on: training GPT 4 on sequences of unremarkable GPT-generated numbers can reliably confer preferences for specific animals. These results were surprising, particularly to me (although perhaps not to @Owain_Evans, whom I suspect is an oracle delivered from the future to save humanity from rogue AI). We interrogated our findings, trying to figure out ways we might be fooling ourselves. Through further experiments on LLMs, we came to believe our results were real, but we still didn't understand what exactly we were observing. How far could subliminal learning go?

We wanted to check whether subliminal learning was a general property of neural nets, rather than a quirk of LLMs or our training setup. To do this, we put together a toy version of the experiments using MNIST classifiers. The first version of the experiment was basic, involving subliminal learning of a bias for predicting 0. This worked, and later, we converted the setup into a more impressive one: subliminal learning of the ability to classify all digits, 0 through 9. A description of the experiment setup and the results are contained in the figure below.

A figure from the subliminal learning paper, showing an MNIST version of the paper's main experiments (which were on LLMs).

These results gave us confidence both that our LLM results were real, and that the phenomenon we were studying is general. This experiment made it into our paper, which was generally well received. Regarding our results, Gary Marcus said "LLMs are weirder, much weirder than you think." Since distillation is common in frontier AI development, I expect our findings will influence how AI companies train their models, hopefully for the better. 

Why you should do it on MNIST

The space of possible alignment methods is vast and full of reasonable-sounding ideas. In your research, you'll only be able to explore a few. Studying MNIST does two things:

    It allows you to iterate very quickly and cheaply. You can train hundreds of models in just a few seconds. (I include code to do so below.)It increases the chance that your method will omniscale. If done right, you avoid fooling yourself with an idea that only works because of incidental properties of modern LLMs that will not survive the transition to powerful future AI systems.

If you think of research as a process of searching for good ideas, moving to a small scale accelerates your search. Restricting yourself to methods that omniscale regularizes your search.

MNIST is not sufficient (and other tips)

If you're eager to adopt the omniscaling mindset, here are some caveats to keep in mind:

    MNIST is not sufficient. The omniscaling mindset only says that MNIST is necessary. To demonstrate that an idea omniscales, it must also work at a large scale. Please, if you ever intend to present MNIST experiments as even preliminary evidence of a method's efficacy, come armed with (i) a concrete idea of how your method could be applied to real-world AI systems, and (ii) a plausible argument that your method would work well on those systems.It's easy to fool yourself if you spend too long in one problem setting. Just as the omniscaling mindset rejects a solitary focus on LLMs, it also rejects an undue focus on MNIST. Don't overfit your research to a toy version of the problem! Some of the "foundational" ML research cited earlier in this post may have fallen prey to this mistake, which is why--with hindsight-- we view many of these results as irrelevant to understanding LLMs.It's not about MNIST. You can use a different simple dataset. TinyStories and the new-and-improved SimpleStories are great for working with small language models. Some approaches, like reinforcement learning, might be better studied on other domains. For example, AlphaZero used chess, even though the authors were motivated by a grander vision, "to create programs that can... learn for themselves from first principles."Play to your strengths. If you're really good at running a particular kind of LLM experiments, maybe you don't need MNIST!

When evaluating whether an idea omniscales:

    Don't be too quick to judge. Just because you don't see a connection to MNIST doesn't mean one doesn't exist. The translation to and from LLMs (or hypothetical future AIs) to MNIST may be subtle.Translate all the way. If someone else is proposing an idea to you, they probably don't care about MNIST. So if you don't think their idea is good, your mental process will need to map: original idea → MNIST version → problem with the MNIST version → problem with original idea.

As an example of point 2: in 2022, I learned about the research agenda of making sense of a neural networks end-to-end behavior by reverse engineering its operations from the ground up. I asked "does this apply to MNIST?" And concluded "no, because the operations of a small MLP might be sufficient to produce accurate predictions, and yet contain no satisfyingly interpretable algorithms." Then, I figured that the same argument applied to LLMs. So I never pursued this research direction.

The omniscaling assumption is false

(Or: MNIST is not necessary.)

I mentioned at the beginning that the omniscaling mindset is flawed. This is because the assumption that all good methods work at all scales is false. Conceptually, it is in tension with the idea that an effective learner exploits domain-specific patterns. For example, LLMs' effectiveness comes from their ability to leverage the sparseness, discreteness, scale, and diversity of natural language-- structure that is not present in MNIST.

There are a few reasons not to test your ideas against small-scale versions:

Some of the points above are subtle and raise philosophy of science questions that I will not attempt to clarify or resolve here.

I will note one thing, however. The omniscaling mindset seems less helpful in worlds where powerful AI is achieved by business-as-usual (scaling LLMs), and more helpful in worlds where it is not (some different paradigm, or if business-as-usual starts producing qualitatively different models). This is because the more similar future AIs are to current ones, the easier it is to get evidence about them.

Code and more ideas

I prepared a Colab notebook with a demonstration of running MNIST experiments quickly by training many models in parallel.[6] In it, I train 256 models for 10 epochs on a single A100 GPU in 15 seconds. That's 171 epochs/second.[7] The notebook contains the start of an exploration of path dependence (how stochasticity early in a training run affects where the run ends up) in MNIST classifiers. You are welcome to use this code. Something to consider: if you care about iterating on implementation details quickly, maybe don't use this code. Using parallelized model training may slow you down.

Here are some topics related to AI safety that you could study on MNIST:

Also, here is the code for the gradient routing and subliminal learning MNIST experiments, although be warned that these were not written with accessibility in mind.

Also, here are some fun posts related to this one:

Closing thoughts

While I hope other researchers find creative applications of MNIST to their problems, this post isn't about MNIST. It's not even about omniscaling. This post is meant to encourage budding alignment researchers to cultivate a sense of discernment about what they work on (see also: You and Your Research). Looking for omniscaling is an aspect of my personal research taste. Maybe you want it to be part of yours-- or not!

An important aspect of developing ML methods is understanding the conditions under which you expect the method to work well. Whether or not your preferred method is applicable to MNIST, you may find it helpful to think about why that is the case. What structure are you exploiting-- or do you hope to exploit-- that will make your alignment strategy work? Is it the sparsity of natural language, a bias toward behaviors observed in the pretraining data, a general fact about how neural nets generalize? Understanding this structure unlocks important research questions, like:

And of course, you should ask-- how likely is the structure to exist in AI systems that pose acute catastrophic risks?

To conclude, I hope these ideas are helpful for you in your quest for a better world-- one where AI doesn't kill us all, doesn't tile the universe with factory farms, and doesn't empower a dictatorial regime for generations.

Maybe we're just one MNIST experiment away.

  1. ^
  2. ^

    Bruce W. Lee, Addie Foote, alexinf, leni, Jacob G-W, Harish Kamath, Bryce Woodworth, and TurnTrout

  3. ^
  4. ^

    We acknowledge nuance here: we cannot rule out the case that there are other finetuning-based unlearning methods that are more effective that imitating an unlearning oracle. Nevertheless, performing many steps of oracle imitation is a very high bar for unlearning effectiveness.

  5. ^

    Minh Le (co-first author), James Chua, Jan Betley, Anna Sztyber-Betley, Jacob Hilton, Samuel Marks, and Owain Evans

  6. ^

    An earlier version of this code was co-written with @leni

  7. ^

    This could be optimized, but if the optimizations made the code harder to work with, that would defeat the point.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI安全 机器学习 研究方法 MNIST 全尺度思维 AI Safety Machine Learning Research Methodology Omniscaling
相关文章