AI Snake Oil 09月12日
AI安全:模型属性的误解与情境依赖
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

文章探讨了AI安全中一个普遍存在的误解:即AI安全被视为模型本身的属性。作者指出,这种观念导致了在模型层面投入大量资源进行红队测试和对齐技术的研究,但效果有限。文章强调,AI安全很大程度上取决于模型部署的情境和环境,单纯修复模型本身难以有效防止滥用。例如,生成钓鱼邮件或生物武器信息的能力,其风险并非源于模型本身,而是其在特定情境下的应用。因此,文章建议将防御重点从模型内部转移到外部,评估边际风险,并重塑红队测试的目标,以更好地应对AI滥用带来的挑战。

💡 **AI安全并非模型固有属性,而是情境依赖性强。** 文章指出,将AI安全视为模型本身的属性是一种误解,这导致了过分依赖模型层面的修复,如红队测试和对齐技术。然而,AI的安全性和潜在风险很大程度上取决于模型部署的具体环境和应用场景。例如,生成钓鱼邮件的能力并非模型本身的恶意,而是其被用于欺骗用户时才构成风险,而模型本身无法区分营销邮件和钓鱼邮件。

🛡️ **防御滥用应侧重于模型外部的防护措施。** 鉴于AI安全的情境依赖性,作者建议将防御重点从模型内部转移到外部。例如,对抗钓鱼邮件的有效手段是邮件扫描器和URL黑名单,这些是模型部署后的下游防护。试图通过模型本身来阻止所有潜在的滥用行为,可能会导致误伤(false positives),即拒绝许多无害的请求,同时仍然无法完全阻止恶意使用(false negatives)。

🚀 **重塑红队测试目标,转向早期预警。** 文章提出,红队测试不应再追求判断模型是否“安全”(因为模型总有被滥用的可能),而应专注于了解先进AI模型所能实现的、不断演进的对手能力。通过模拟攻击场景,可以为防御者提供早期预警,从而开发出更有效的外部防御措施。例如,了解AI是否能自动化复杂的网络攻击链,以便及时部署相应的防御策略。

🤝 **鼓励第三方主导红队测试,确保激励相容。** 作者认为,由开发者主导的、以模型为中心的红队测试可能存在激励问题,因为开发者可能倾向于避免发现模型潜在的滥用能力。因此,由具备更优越激励机制的第三方来领导红队测试,可以进行更全面的风险评估,并为开发外部防御措施提供信息,从而更好地促进AI的负责任发展。

The assumption that AI safety is a property of AI models is pervasive in the AI community. It is seen as so obvious that it is hardly ever explicitly stated. Because of this assumption:

We think these efforts are inherently limited in their effectiveness. That’s because AI safety is not a model property. With a few exceptions, AI safety questions cannot be asked and answered at the levels of models alone. Safety depends to a large extent on the context and the environment in which the AI model or AI system is deployed. We have to specify a particular context before we can even meaningfully ask an AI safety question.

As a corollary, fixing AI safety at the model level alone is unlikely to be fruitful. Even if models themselves can somehow be made “safe”, they can easily be used for malicious purposes. That’s because an adversary can deploy a model without giving it access to the details of the context in which it is deployed. Therefore we cannot delegate safety questions to models — especially questions about misuse. The model will lack information that is necessary to make a correct decision.

Based on this perspective, we make four recommendations for safety and red teaming that would represent a major change to how things are done today.

Safety depends on context: three examples

Consider the concern that LLMs can help hackers generate and send phishing emails to a large number of potential victims. It’s true — in our own small-scale tests, we’ve found that LLMs can generate persuasive phishing emails tailored to a particular individual based on publicly available information about them. 

But here’s the problem: phishing emails are just regular emails! There is nothing intrinsically malicious about them. A phishing email might tell the recipient that there is an urgent deadline for a project they are working on, and that they need to click on a link or open an attachment to complete some action. What is malicious is the content of the webpage or the attachment. But the model that’s being asked to generate the phishing email is not given access to the content that is potentially malicious. So the only way to make a model refuse to generate phishing emails is to make it refuse to generate emails. That would affect many non-malicious uses, such as marketing.

We see the same pattern over and over. There has been alarm about LLMs being able to give bioterrorists information on how to create pathogens. But that information is readily available on the internet. The hard parts for would-be bioterrorists are all of the other steps involved: obtaining raw materials, culturing cells in the lab without killing them or infecting oneself, and disseminating the bioweapon to cause harm. AI could potentially aid that work, as it is a general-purpose tool and has some usefulness for almost all knowledge work. Again, this illustrates the limits of attempting to build safety into models: most of the questions the user would ask in this process relate to synthetic biology in general and not bioweapons in particular. To be sure that a model couldn’t assist bioterrorists, it would have to refuse to assist with any sort of bioengineering.

Or consider the use of LLMs to generate disinformation. Even in the unlikely event that a model could be aligned so that it refuses all requests to generate false information, research has found that true-but-misleading information is far more impactful than false information on social media; 50x more in the case of increasing vaccine hesitancy. So even a hypothetical safe model could be used to aid disinformation efforts: the adversary would use it to generate factual information (e.g. accurately summarizing news stories), with the misleading context added in separately.

In short, trying to make an AI model that can’t be misused is like trying to make a computer that can’t be used for bad things.

Scope of our claims

This essay is primarily about misuse, which seems to be the biggest driver of the AI safety worries recently. This includes both malicious misuse, such as the above examples, and nonmalicious misuse, such as students cheating on homework. Here again the model lacks the context to prevent only “bad” uses: it doesn’t know whether the task it is asked to perform is part of the user’s homework.

AI safety encompasses many other types of failures, such as bias and toxicity, accidents, reward hacking, and adversarial inputs (such as prompt injection). These are all different from misuse risks.1 We think our argument applies in many of these cases, though less strongly. We don’t give a full analysis here. In the case of accidents, others have made the point that we have to look at the system and context, rather than the model alone.

Another related failure mode, one that is outside our scope, is overreliance on flawed models for legal or medical advice (whether this falls under AI safety is debatable but tangential to our point). To understand these harms, studying models makes sense: for example, a recent investigation by the AI Democracy Projects found that most models have high rates of incorrect responses to questions about the election.

Even within the category of misuse, there are a few exceptions to the rule that safety is not a model property. Some types of content are intrinsically problematic regardless of what someone does with it, as in the case of AI-child sexual abuse material.2 Aligning AI systems to refuse such requests is important. Outputting memorized copyrighted material is another such category. 

In any case, our point is not that red teaming or aligning models is useless, just that safety has to be much broader than looking at models alone.

You’re reading AI Snake Oil, a blog about our upcoming book. Subscribe to get new posts.

Recommendation 1: defenses against misuse must primarily be located outside models

We’ve written before that model alignment can easily be evaded by adversaries. Those evasive techniques, such as jailbreaks, are potentially fixable. Here, we are talking about something more fundamental: misuse that does not require breaching the alignment guarantees in any way, such as writing persuasive emails that can be used for either marketing or phishing.

If model alignment is not the answer, other defenses are sorely needed. As we’ve consistently argued, defenses should focus on attack surfaces: the downstream sites where attackers use the outputs of AI models for malicious purposes. For example, the best defenses against phishing emails, whether generated by humans or LLMs, are email scanners and URL blacklists — which we’ve had for a couple of decades and have gradually gotten pretty good, although of course we must continue to improve them.

If we instead keep barking up the tree of model alignment, the fact that the model lacks access to context, and therefore can’t make informed safety determinations, will lead to both false positives and false negatives. In other words, it will not only lead to a failure to prevent misuse, but also the opposite problem: refusing innocuous requests like an overzealous censor.

Recommendation 2: assess marginal risk

If safety is not primarily a model property and defenses must reside elsewhere, then there might not be a big difference between the safety implications of open and closed release strategies. In any case, the debate on openness in AI needs a more rigorous risk assessment framework. We were recently part of a large collaboration that presented just such a framework. It enables assessing the marginal risk of releasing a model — that is, the additional or incremental risk — compared to the risk from existing models (and non-AI technologies). It takes into account that defenses for some risks might already exist, especially defenses located outside models. Using this framework, we showed that the marginal risk of open models in cybersecurity (specifically, enabling automated vulnerability detection) is low, whereas for the generation of non-consensual intimate imagery, the marginal risk is substantial.

A notable potential safety advantage of closed models is the ability to monitor queries and retrospectively identify malicious use. This is a far easier technical problem than building safety into the model itself.3 Besides, the risk of account suspension or prosecution might exert a deterrent effect on threat actors. In any case, this sort of comparison between open and closed models will have to be made separately for each type of misuse based on empirical evidence. Currently, we don’t have reliable evidence of how well monitoring and detection is working because of the lack of transparency by developers. One small but notable exception is a recent blog post on Microsoft’s and OpenAI’s efforts to detect and disrupt hacking groups’ use of LLMs.

Recommendation 3: refocus red teaming toward early warning

We should not expect red teaming to tell us whether or not a model can be misused (the answer is always yes). Instead, we should use red teaming to learn about the advancing frontier of adversary capabilities enabled by state-of-the-art AI models. For example, if AI systems have gotten powerful enough to automate a complex cybersecurity attack chain — scanning social media profiles to gather information, crafting a phishing email, taking over an account, exfiltrating information, and concealing traces of the attack — we need to have early warning of those capabilities so that we can defend appropriately.

To do this, we may need to design better offensive pipelines, such as for hacking. In the case of disinformation, a key offensive capability would be building a bot that can engage in a persuasive conversation on political topics over a long period of time. Building such capabilities raises ethical challenges. The cybersecurity community has long grappled with these challenges, and the general conclusion is that we are better off in a world where everyone has access to offensive capabilities than one in which only attackers do.

The results of red teaming should inform the development of defenses — defenses that almost always will reside outside AI models (such as detecting and labeling bot accounts on social media).

Recommendation 4: red teaming should be led by third parties with aligned incentives

The above change in objectives of red teaming leads to a subtle shift in incentives. When red teaming is model focused, developers have an incentive to do a good job. If they find that models produce “dangerous” information, they can fix that behavior, which helps them avoid bad press.

But for the kind of misuse we’re talking about, the incentives are reversed. It is not in developers’ interest to build the most powerful offensive pipeline possible. If they do, they might find that (for instance) a model can be used for hacking, but they will have no way to prevent this. Thus, they will have to admit that they are knowingly releasing a model that can be used for offensive purposes. It is much better for them to not find out in the first place. 

Consider OpenAI's recent study on biological threats from language models. OpenAI evaluated the risk of users gaining access to information using language models compared to the internet (which is much better than previous studies that don't use the internet as a baseline at all). But creating a bioweapon requires far more than a few hours of information hunting. A motivated actor needs access to a lab, reagents, and equipment in order to even begin the process. The real question is whether AI can help adversaries acquire these resources. The study does not answer that.

The incentives of third parties are potentially better aligned for a more holistic risk assessment that is less focused on models alone. But here too, there is need for caution. Until recently, much of the evidence for biosecurity risks of language models came from groups funded by a small number of effective altruism organizations. Members of the U.S. House Committee on Science, Space, and Technology recently wrote a letter expressing concern about the lack of transparency in how the National Institute of Standards and Technology and the AI Safety Institute plan to allocate funding for third-party evaluations. They were especially concerned about upholding the standards of scientific research.

Final thoughts: developer responsibility

Why has the myth of safety as a model property persisted? Because it would be convenient for everyone if it were true! In a world where safety is a model property, companies could confidently determine whether a model is safe enough to release, and AI researchers could apply their arsenal of technical methods toward safety. Most importantly, accountability questions would have relatively clear answers. Companies should have liability for harms if model safety guarantees fail, but not otherwise.

By contrast, accepting that there is no technical fix to misuse risks means that the question of responsibility is extremely messy, and we don’t currently have a good understanding of how to allocate liability for misuse. Assuming that retrospective detection is easier (see recommendation 2 above), one low-hanging fruit is to require anyone who hosts a model, whether closed or open, to adhere to certain standards for monitoring and reporting misuse — see our call for generative AI companies to publish transparency reports (and, more generally, the least cost avoider principle). But that won’t be enough, and downstream defenses are needed.

Unfortunately, downstream defenses against misuse impose a great cost on the rest of society. For example, the fact that any image or video could be AI-generated means that realism is no longer a marker of authenticity. That means we all need to adapt and change how we assess the veracity of information online. And that’s just one of dozens of such adaptations needed.

Morally speaking, developers should bear some of the societal costs of harmful uses of AI, mirroring the fact that they reap profits from beneficial uses of AI. But legally speaking, we have no tool to enforce that. Remedying this situation is the great challenge of AI policy — a point we’ve made over and over. No amount of “guardrails” will close this gap.

Acknowledgment. Arvind Narayanan is grateful for being invited to a National AI Advisory Committee AI Safety panel, where he presented some of these ideas. We thank Mihir Kshirsagar for feedback on a draft.

1

Prompt injection might enable misuse, but this post is about misuse that doesn’t require violating model safety properties.

2

Update: we originally used nonconsensual intimate imagery (NCII) as an example here. But NCII is not an exception to the rule; in fact, it is an excellent example of it. The model does not know whether the request is consensual or not; so the only way to refuse to generate NCII is to refuse to generate any intimate imagery. We are grateful to @belladoreai on Twitter/X for pointing this out. That said, blanket refusal to generate nudes is a worthwhile tradeoff that many model developers make, unlike refusing all requests to generate emails, which is hard to justify. And if they've made that tradeoff, they can try to enforce it at the model level (at least for closed models). In short, NCII is not an exception to the pattern, but it does have a different relative cost of false positives and false negatives.

3

There are many reasons why forensics is easier than model alignment. It only requires assessing whether an account, which may have made hundreds of queries, has violated usage policies, rather than making assessments query by query, so there is a lot more information to go on. Analysts might also be able to use a materialized attack as evidence of intent: for example, a disinformation campaign on social media that used content generated by a model. Broadly speaking, real-time AI assessment is no match for patient human analysis in making nuanced determinations of malicious use.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI Safety AI Misuse Model Alignment Red Teaming Context Dependency AI Policy 人工智能安全 AI滥用 模型对齐 红队测试 情境依赖 AI政策
相关文章