cs.AI updates on arXiv.org 10月07日
LLM数据剽窃检测:水印嵌入与假设检验
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文针对LLM训练数据中的版权材料剽窃问题,提出通过水印嵌入和假设检验方法进行检测,构建统计测试框架,控制误差,并通过实验验证了方法的有效性。

arXiv:2501.02441v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) are rapidly gaining enormous popularity in recent years. However, the training of LLMs has raised significant privacy and legal concerns, particularly regarding the distillation and inclusion of copyrighted materials in their training data without proper attribution or licensing, an issue that falls under the broader concern of data misappropriation. In this article, we focus on a specific problem of data misappropriation detection, namely, to determine whether a given LLM has incorporated the data generated by another LLM. We propose embedding watermarks into the copyrighted training data and formulating the detection of data misappropriation as a hypothesis testing problem. We develop a general statistical testing framework, construct test statistics, determine optimal rejection thresholds, and explicitly control type I and type II errors. Furthermore, we establish the asymptotic optimality properties of the proposed tests, and demonstrate the empirical effectiveness through intensive numerical experiments.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LLM 数据剽窃 水印检测 假设检验 统计框架
相关文章