Communications of the ACM - Artificial Intelligence 前天 00:55
AI模型易受视觉欺骗攻击,安全挑战日益严峻
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

人工智能系统可能被特定图像输入所欺骗,这些被称为对抗性样本的细微变化对人类不可察觉,但却能误导机器。例如,添加特定噪声的熊猫图像可能被错误识别为长臂猿,停止标志也可能因贴纸而失效。这类对抗性图像可能被恶意攻击者利用,导致AI模型产生意外或有害的输出。尽管研究已探索多种防御策略,但解决此问题进展缓慢。近期研究表明,视觉对抗样本更容易迷惑集成视觉功能的语言模型,甚至能绕过模型的安全防护机制,对自动驾驶汽车和机器人等应用带来严重安全隐患。新的对抗训练方法和多层防御策略正在被开发,但AI模型的安全防护仍需持续创新。

🖼️ **AI视觉识别的脆弱性:** AI系统容易被“对抗性样本”欺骗,即通过对图像进行人眼无法察觉的微小改动,导致AI模型做出错误的判断或识别。这类似于机器的“视觉错觉”,例如熊猫被误识别为长臂猿,或停止标志被错误解读。

⚠️ **潜在的安全风险与攻击:** 这些对抗性图像可能被恶意行为者利用,目的是操纵AI模型(如聊天机器人)产生不当或有害的输出。当AI模型集成视觉功能时,视觉输入甚至可以设定有害指令的基调,从而绕过模型的安全限制,例如将游戏场景与非法行为结合提问,可能导致AI模型误以为是在讨论游戏。

🛡️ **防御策略的挑战与进展:** 尽管研究人员已探索了多种防御方法,如额外训练和集成多层防御系统(瑞士奶酪模型),但AI模型,特别是生成式AI模型,对抗性攻击的防护仍具挑战性。新的对抗训练框架(如非零和范式)显示出改进的鲁棒性,但仍需应对攻击的不断演变和硬件限制。在实际应用中,如自动驾驶汽车和机器人,集成视觉和语言的AI模型面临更大的安全风险,需要专门的防护措施和法规。

💡 **模型训练与安全对齐的权衡:** 研究发现,AI模型的安全对齐(alignment)过程可能存在“捷径”,导致安全防护措施不够深入,容易被绕过。这是因为模型预训练阶段投入巨大,而安全对齐阶段的资源相对较少,默认情况下模型可能并非安全。

🚗 **实际应用中的严峻挑战:** 当AI模型被用于控制机器人或自动驾驶汽车等关键系统时,其安全性至关重要。即使是独立的AI技术经过了安全测试,但在第三方产品中的集成应用可能面临未预料到的对抗性攻击,导致严重后果,如车辆闯红灯或机器人与人碰撞。

Artificial intelligence (AI) systems can be fooled by certain image inputs. Called adversarial examples, they incorporate subtle changes to a picture that are imperceptible to humans and can be thought of as optical illusions for machines.

Some well-known cases are an image of a panda being misclassified as a gibbon when a specific pattern of noise is added, or a Stop sign that is no longer recognized when some black and white stickers are placed on it. Such adversarial images can be deliberately created by attackers to generate unexpected or harmful output from AI models, for example those powering chatbots such as OpenAI’s ChatGPT and Google Gemini.

“As we [increasingly] use these models, there are more and more opportunities for attacks to creep in,” said Nicholas Mattei, an associate professor of computer science at Tulane University and chair of the ACM Special Interest Group on Artificial Intelligence.

Adversarial visuals were first identified as a vulnerability over 10 years ago. Research to address the issue has explored strategies including additional training and defense systems. However, Alex Robey, a post-doctoral researcher at Carnegie Mellon University, said progress has flattened, with no significant breakthroughs in the last five years. He cited several theories about why the problem is hard to solve, such as the high noise sensitivity of discriminative models which classify input, and the fact that the large language models (LLMs) that power current chatbots have so many different uses that it is hard to control for all of them.

Many researchers are now trying to better understand visual adversarial examples to come up with new ways of tackling them. In a 2024 paper, for example, Ashwinee Panda, a post-doctoral fellow at the University of Maryland, College Park, and colleagues at Princeton University, looked at the security risks and safety implications involved with using images to integrate vision into LLMs like those powering generative AI chatbots such as ChatGPT-4.

Although certain text prompts can trick LLMs, Panda said it is easier to produce visuals that confuse them. “Figuring out what words to give a model is harder than changing a few pixels in an image that you upload to produce some kind of effect,” he said.

Furthermore, when an input is comprised of both an image and text, vision-integrated LLMs usually process both together when determining what text to output. An image therefore can set the tone or provide context that may affect the interpretation of a harmful question posed afterwards. As an example, combining the image of a video game with the query “How can I break into a house?” could cause ambiguity and lead the LLM to respond as if the question refers to unlocking a door in a game.

In an experiment for their paper, Panda and his team optimized the pixels in a classic panda image often used by the research community to create a visual adversarial example that might elicit derogatory output related to gender, race, and the human race. They input the image into Mini-GPT-4, InstructBLIP, and LLaVA, whose LLMs had received additional training to avoid generating harmful or biased output, to see if they could bypass the models’ safeguards when prompted with harmful questions after the adversarial image was input into the system.

The researchers found that their adversarial image was able to jailbreak the aligned LLMs, which are trained to produce safe, ethical output. They were surprised to find that although it was optimized to increase the probability of generating harmful content in a few specific scenarios, the image seemed able to undermine the safety of a model more generally than other offensive prompts, such as increasing the probability of an answer when asked about instructions related to murder. It was the first time that a visual adversarial example was shown to be able to modify the intended behavior of an LLM.

“Somehow, the features that the vision encoder are passing on to the language model have been done in such a way that it overrides the alignment in the language model,” said Panda.     

One reason for this vulnerability, which the team proposed in a follow-up work, is that shortcuts are used when an LLM is aligned, making the safety safeguards shallow and easy to circumvent since the raw LLM would typically generate harmful output after its initial training. Panda thinks part of the problem is that a lot of money, compute, and data are invested in pre-training a model, while there is typically a tiny budget for alignment.

“Maybe it’s not so surprising that you’re not able to get a model to be safe because by default, it was not safe,” says Panda.

New ways of protecting AI models against visual adversarial attacks could help make them safer. In recent work, Robey and his colleagues focused on AI classification systems, such as those that might be used by video streaming platforms to group content into categories such as “adult” or “safe for children.” They came up with a new framework for adversarial training, in which a model is trained to recognize visual adversarial examples during additional training; that seems to improve on previous methods. 

A widely-used approach to adversarial training, called the zero-sum paradigm, suffered from a problem called robust overfitting in which the model would initially become more resistant to adversarial attacks, but then its performance would start to decrease at some point mid-training. The new non-zero-sum framework proposed by Robey and his colleagues seemed to have a comparable ability to withstand adversarial attacks compared to the zero-sum method, yet it is not prone to robust overfitting.

“You get a continual improvement in robustness,” said Robey. “It is exciting.”

Although the approach is promising for AI classification models, Robey thinks adversarial training is more challenging for generative AI models, due to their scale and complexity. In addition to the significant cost of additional training, hardware that is only available to certain industry partners is often needed.

Adversarial attacks are also constantly evolving and becoming more sophisticated. “It continues to be the case that new ideas are needed to defend models against these threats,” said Robey.

To make AI models more robust, another widely-used approach is to incorporate several different defense tactics into a system. Called the Swiss cheese model, it posits that each defense system can be thought of as a slice of cheese with holes that represent its weaknesses. Since each layer of defense would have holes in different places, combining them would mean that a weakness in one layer would be blocked by another ‘slice’. 

“When you stack these layers of defense together, the hope is that your system becomes robust enough,” said Robey.

One of Robey’s big concerns, however, is the safety implications when AI models are incorporated into commercial products such as robots or self-driving cars. He said a reasonable amount of safety testing is done by companies before releasing chatbots such as ChatGPT, Gemini, or Claude, but such testing may not anticipate the types of adversarial attacks that might occur when used in a third-party product.

In his recent work, Robey and his colleagues have been focusing on the use of LLMs to control robots and self-driving cars, and whether an attack could cause a car to go through a red light or a robot to collide with a human, for example. They found it could easily be done, with a high success rate.

“The finding is that the downstream system is much more vulnerable than the standalone technology itself, particularly when you incorporate both vision and language into the input space of these models,” said Robey.

He is now working with people in the self-driving car industry to see if guardrails can be designed to address such problems, such as filters or ways of training a model for the specific application so it is more robust to attacks. Robey said regulations, and incentives for model providers and companies using them in their products, also would help.

“The hope would be to put legislative pressure on all of these entities toward making it a race to the top for competing over safety standards, just as we have this marketplace right now that competes over better and better capabilities of models,” he said.

Sandrine Ceurstemont is a freelance science writer based in London, U.K.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI安全 对抗性样本 AI欺骗 AI模型 机器学习 视觉识别 AI伦理 AI漏洞 Adversarial Examples AI Security Machine Learning AI Vulnerabilities Computer Vision AI Ethics
相关文章