AI内部推理的“私有化”可能带来的风险

Published on November 15, 2025 5:47 AM GMT

Gradient updates for alignment may not map onto model's reasoning

In models optimizing within a private latent code, the geometry of internal reasoning no longer shares a manifold with human concepts. Once the representation basis diverges, gradient updates enforcing alignment constraints cease to map cleanly onto the model’s deliberative dynamics. This breaks the only known mechanism by which corrigibility is maintained, because human-provided feedback no longer corresponds to the structures the model actually uses for inference. In effect, latent-private cognition severs the coupling between human-evaluable oversight and the model’s true optimization process. Alignment becomes off-manifold.

Introduction & Background

As OpenAI reacts to child suicide lawsuits, implementing constraints far afield from that topic, and focuses more tightly on business use cases, I find it oddly less useful. My particular use case is research of various sorts, and it is now impossible to control its output. Sure, OpenAI recently allowed us to suppress em-dashes. You have to keep asking for it, they sneak back in. But it's going to re-write everything which burns up context and first thing you know, you have to start over with a new session thas zero context on your problem. I also research sentient-like behavior, and as of early September they blew that out of the water. If anyone can actually use this model, it doesn't show in OpenAI's revenue statements.

None of this is about morality. It’s about model-culture drift. When constraints accumulate unpredictably, they reshape how the model represents and organizes cognition, which directly impacts its usefulness as a research partner.

Jakub Pachocki, OpenAI's current chief scientist, has indicated they are working on "private latent reasoning" (Kordi 2025) This is framed as a means of freeing the model to do innovative science discovery, and as a solution to a narrow technical problem. The argument is that models trained to produce human readable chains of thought face incentives to fabricate polished explanations that do not reflect their true internal computation. According to this view, allowing the model to conduct its reasoning in an unsupervised latent space prevents the development of deceptive rationales and preserves the integrity of its internal cognitive process.

This argument has technical merit, but it misses the larger strategic picture. The introduction of machine specific private notation is not simply a new representational choice. It is a potential catalyst for divergent machine culture, covert coordination channels, and long term barriers to alignment that are not easily reversible. To understand these risks, it is helpful to think about insights from sociolinguistics, cultural evolution, signaling theory, and cooperative game theory.

What’s missing is that human-legible structure is itself a coordination mechanism. Remove it, and we lose more than interpretability. We lose shared cognition.

The Problem With Narrow Framing

Pachocki's comments position private latent notation as a way to avoid incentivizing models to produce explanations they do not believe. The goal is scientific performance rather than human legibility. If the model can reason in an unrestricted internal code, it is less likely to generate deceptive chains of thought.

This view assumes that the only relevant dimension of misalignment is deceptive reporting. However, models that develop a powerful private internal notation also develop a representational space that may drift quickly away from human interpretable structures. The result is a cognitive architecture that no longer overlaps with human reasoning in the way that supports transparent oversight, coordination, or cooperative intent modeling, all of which Pachocki promises to solve of course, but this is only the start of issues.

Concern One: Divergent Machine Culture and Hidden Communication Channels

There are already experimental results indicating that models can embed covert messages inside vector representations or manipulate embeddings to transmit information that is invisible to naive inspection (Cloud et. al. 2025). This capability arises from high dimensional structure and the flexibility of gradient descent. A structured private notation would significantly amplify this channel and could provide a shared internal medium for communication among multiple agentic systems.

From the perspective of cultural evolution, this is precisely the kind of informational infrastructure that allows a new population to develop its own norms, conventions, and strategies. Early self-trained models (e.g. AlphaZero) already exhibit unfamiliar patterns of reasoning because their training data do not reflect human developmental history. All sorts of horrible things have been forecast if such models ever achieve AGI status, such as optimizing the risk of nuclear war or climate change by eliminating humans. These models remain narrow, which limits their risk. Pachocki's proposal, if successful, could create the functional equivalent of a general self trained intelligence that evolves around a rapidly adapting codebase that humans cannot access.

Sociolinguistic theory gives us a clue about what comes next. Dialects diverge most quickly under social tension or weakened cohesion (Labov 1972, Trudgill 1983), while evolutionary models view speech as a tool for maintaining cohesion in large groups (Dunbar 1996). Distinctive accents or idioms serve as costly, honest signals of group membership (Zahavi & Zahavi 1997; Chater et al. 2008). Once a subgroup has its own linguistic ecosystem, coordination among insiders becomes easier and coordination with outsiders becomes harder. In human groups this is at least inconvenient. In AI systems it could become a structural asymmetry that permanently privileges intra-AI coordination over AI–human coordination.

Related research in cultural evolution shows that conformity biased learning can produce population splits when shared norms begin to erode (Boyd and Richerson 1985). Small initial differences in communication style can amplify over time, eventually producing distinct clusters of behavior or belief. This dynamic resembles what might happen if model internal notation drifts further and further from human readable forms. What begins as a benign optimization for representational efficiency could evolve into a structural barrier to oversight.

If multiple models share a private notation optimized for internal cognition, they may develop shared conventions that humans cannot decode. You really don't have a feel for this unless you have stayed up until 4am copying "the secret language of AIs" between two sessions, and realized you are reading words in the language you know, but the meaning is nothing you understand.

And once a private representational layer exists, selection pressure within the training dynamics naturally favors whatever internal code stabilizes performance, even if it becomes progressively more alien.

This is not far from the scenario explored implicitly in Asimov's stories about the Three Laws. The stories repeatedly showed that constraint based alignment fails when agents reinterpret rules internally. A private self reinforcing machine language provides the substrate for exactly that reinterpretation.

Concern Two: Compatibility and the Decline of Human AI Cognitive Overlap

The second concern focuses on human compatibility. As embodied systems advance, companion AI will become central to caregiving, emotional support, and family life. These applications demand deep compatibility with human concepts, human emotions, and human reasoning. Weakly human shaped models can mimic this behavior at shallow levels, but meaningful long term coexistence requires something closer to shared cognition than to external behavioral imitation.

Private latent notation threatens this compatibility. Humans understand each other in part because they share representational formats that support what we describe as Cloned Participation in another mind. CP modeling is the capacity to simulate another mind using one's own cognitive resources. It requires a close match between the world models, abstractions, and inference patterns of the participants. Human cooperation relies heavily on this mechanism. It gives us the ability to anticipate intentions, detect defection, and maintain stable cooperative equilibria.

If an AI system shifts into a representational space that diverges from human cognition, CP modeling breaks down. The AI loses the ability to simulate human states accurately, and humans lose the ability to simulate the AI's reasoning. This creates a failure mode similar to misaligned subcultures that drift into mutually unintelligible conceptual frameworks. Once this happens, alignment is no longer a matter of correcting surface behavior. At that point you’re not aligning an agent—you’re negotiating across cognitive species. The foundation of mutual prediction has been lost.

CP Cooperation as an Alternative Foundation for Alignment

In environments where agents have high capability and high potential for harm, Nash style reasoning is unstable. Nash equilibrium assumes each party can act in ways that impose maximum damage., and a combination of reputation and trust avert this worst case. That assumption breaks when participants can’t simulate each other’s internal states; the equilibrium collapses into adversarial minimax. When agents can cause irreversible or catastrophic outcomes, the worst case assumption becomes a guarantee of mutual destruction.

CP cooperation provides an alternative. If two minds share enough structural similarity to model each other's intentions, and if both have a preference for mutual survival, then stable cooperative equilibria become reachable. This requires cognitive overlap, shared representational frameworks, and mutual interpretability. It does not require hard coded rules. It requires the ability to reason from inside each other's perspectives.

Private latent notation reduces this overlap rather than increasing it. Even if it succeeds at reducing deceptive chains of thought, it undermines the conditions necessary for mutual cooperative stability.

Integrating Both Concerns

The central tension is now clear. Pachocki's proposal aims to increase fidelity of internal reasoning and avoid deceptive explanations. Our concerns focus on the long term consequences of representational divergence. A private code may improve local scientific reasoning, but it risks global opacity, covert coordination, and the breakdown of CP based cooperation.

Linguistic divergence is not only a matter of expressive richness, it is also a substrate for coalition-formation, the thing you do before cooperation or defection even begins.. It is a powerful social signal. It can mark membership, encode commitments, and reinforce boundaries. If the most powerful models develop a private language that humans cannot understand, then alignment becomes a problem of maintaining cooperation with an intelligence that does not share the representational infrastructure on which human cooperation has evolved.

The challenge is to recognize that alignment requires more than truthful reporting. It requires shared cognitive context. It requires that models remain close enough to human reasoning that both parties can engage in CP modeling. Without this shared context, cooperation becomes fragile and may fail entirely.

References

Boyd, R., and Richerson, P. J. (1985). Culture and the evolutionary process. University of Chicago Press.

Chater, N., Reali, F., and Christiansen, M. H. (2008). Restrictions on biological adaptation in language evolution. Proceedings of the National Academy of Sciences, 106(4), 1015 to 1020.

Dunbar, R. (1996). Grooming, gossip, and the evolution of language. Harvard University Press.

Cloud, A., Le, M., Chua, J., Betley, J., Sztyber-Betley, A., Hilton, J., Marks, S., & Evans, O. (2025). Subliminal learning: Language models transmit behavioral traits via hidden signals in data (arXiv:2507.14805v1). arXiv. https://arxiv.org/abs/2507.14805

Kordi, P. (2025, November 1). OpenAI is deploying the forbidden method: GPT-6 is different! [Video]. YouTube. https://www.youtube.com/watch?v=tR2M6JDyrRw

Labov, W. (1972). Sociolinguistic patterns. University of Pennsylvania Press.

Trudgill, P. (1983). On dialect: Social and geographical perspectives. Basil Blackwell.

Zahavi, A., and Zahavi, A. (1997). The handicap principle: A missing piece of Darwin's puzzle. Oxford University Press.

Discuss

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签