cs.AI updates on arXiv.org 10月07日
LLMs数据剽窃检测新方法
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文针对LLMs训练中的数据剽窃问题,提出了一种基于嵌入水印的检测方法,将其构建为假设检验问题,并通过构建统计测试框架、确定最优拒绝阈值和显式控制I型和II型错误,验证了方法的实证有效性。

arXiv:2501.02441v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) are rapidly gaining enormous popularity in recent years. However, the training of LLMs has raised significant privacy and legal concerns, particularly regarding the distillation and inclusion of copyrighted materials in their training data without proper attribution or licensing, an issue that falls under the broader concern of data misappropriation. In this article, we focus on a specific problem of data misappropriation detection, namely, to determine whether a given LLM has incorporated the data generated by another LLM. We propose embedding watermarks into the copyrighted training data and formulating the detection of data misappropriation as a hypothesis testing problem. We develop a general statistical testing framework, construct test statistics, determine optimal rejection thresholds, and explicitly control type I and type II errors. Furthermore, we establish the asymptotic optimality properties of the proposed tests, and demonstrate the empirical effectiveness through intensive numerical experiments.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LLMs 数据剽窃 假设检验 统计测试 水印嵌入
相关文章