微眼动启发的LLM行为检测方法

cs.AI updates on arXiv.org 10月03日

微眼动启发的LLM行为检测方法

本文受微眼动启发，提出一种用于检测大型语言模型（LLM）潜在行为异常的方法。该方法通过轻量级位置编码扰动激发模型内部信号，无需微调或特定任务监督，即可检测包括事实性、安全性、毒性和后门攻击在内的多种场景下的模型失败。实验表明，该方法在多个最先进的LLM上有效且计算效率高。

arXiv:2510.01288v1 Announce Type: cross Abstract: We draw inspiration from microsaccades, tiny involuntary eye movements that reveal hidden dynamics of human perception, to propose an analogous probing method for large language models (LLMs). Just as microsaccades expose subtle but informative shifts in vision, we show that lightweight position encoding perturbations elicit latent signals that indicate model misbehaviour. Our method requires no fine-tuning or task-specific supervision, yet detects failures across diverse settings including factuality, safety, toxicity, and backdoor attacks. Experiments on multiple state-of-the-art LLMs demonstrate that these perturbation-based probes surface misbehaviours while remaining computationally efficient. These findings suggest that pretrained LLMs already encode the internal evidence needed to flag their own failures, and that microsaccade-inspired interventions provide a pathway for detecting and mitigating undesirable behaviours.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

微眼动 LLM 行为检测大型语言模型微调

相关文章

Import AI 370: 213 AI safety challenges; everything becomes a game; Tesla’s big cluster

Import AI 368: 500% faster local LLMs; 38X more efficient red teaming; AI21’s Frankenmodel

Exploring the Frontiers of AI: The Emergence of LLM-4 Architectures

Is Claude 3 Outperforming GPT-4?

Graphs and Language

Harmonizing AI: Crafting Personalized Song Suggestions

LangChain, Python, and Heroku

AI News Weekly - Issue #378: Top AI Books to Read in 2024 - Mar 28th 2024

AI News Weekly - Issue #377: Next in AI : Pioneers' Predictions! - Mar 21st 2024

Learn AI Together — Towards AI Community Newsletter #23