Shifting Perspectives: Steering Vectors for Robust Bias Mitigation in LLMs

cs.AI updates on arXiv.org 08月14日

Shifting Perspectives: Steering Vectors for Robust Bias Mitigation in LLMs

本文提出一种利用引导矢量来调整大型语言模型激活的新方法，以减轻模型中的社会偏见。通过在BBQ数据集上训练，该方法在多个数据集上实现了显著的偏见缓解效果，同时保持了较低的MMLU评分影响，为提升AI安全性提供了新的思路。

arXiv:2503.05371v2 Announce Type: replace-cross Abstract: We present a novel approach to bias mitigation in large language models (LLMs) by applying steering vectors to modify model activations in forward passes. We compute 8 steering vectors, each corresponding to a different social bias axis, such as age, gender, or race, on a training subset of the BBQ dataset and compare the effectiveness of these to 3 additional bias mitigation methods across 4 datasets. When optimized on the BBQ dataset, our individually tuned steering vectors achieve average improvements of 12.8% on BBQ, 8.3% on CLEAR-Bias, and 1% on StereoSet, and show improvements over prompting and Self-Debias in all cases, and improvements over fine-tuning in 12 out of 17 evaluations. In addition, steering vectors showed the lowest impact on MMLU scores of the four bias mitigation methods tested. The work presents the first systematic investigation of steering vectors for bias mitigation, and we demonstrate that they are a powerful and computationally efficient strategy for reducing bias in LLMs, with broader implications for enhancing AI safety.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LLM 偏见缓解引导矢量 AI安全性模型优化

相关文章

Import AI 368: 500% faster local LLMs; 38X more efficient red teaming; AI21’s Frankenmodel

Learn AI Together — Towards AI Community Newsletter #23

This AI newsletter is all you need #98

Patterns and Middleware for LLM Applications with Kyle Roche - #659

Building LLM-Based Applications with Azure OpenAI with Jay Emery - #657

Mental Models for Advanced ChatGPT Prompting with Riley Goodside - #652

Data Augmentation and Optimized Architectures for Computer Vision with Fatih Porikli - #635

Rethinking Model Size: Train Large, Then Compress with Joseph Gonzalez - #378

FastGen: Cutting GPU Memory Costs Without Compromising on LLM Quality

Intel Releases a Low-bit Quantized Open LLM Leaderboard for Evaluating Language Model Performance through 10 Key Benchmarks