cs.AI updates on arXiv.org 10月20日 12:14
LLM克隆检测:模型选择与集成策略
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文针对代码克隆检测问题,探讨了LLM模型选择与集成策略,通过筛选76种LLM并评估其性能,发现模型选择需考虑嵌入尺寸、词汇表大小和定制数据集,同时集成方法如最大值或求和可提升检测效果。

arXiv:2510.15480v1 Announce Type: cross Abstract: Source code clones pose risks ranging from intellectual property violations to unintended vulnerabilities. Effective and efficient scalable clone detection, especially for diverged clones, remains challenging. Large language models (LLMs) have recently been applied to clone detection tasks. However, the rapid emergence of LLMs raises questions about optimal model selection and potential LLM-ensemble efficacy. This paper addresses the first question by identifying 76 LLMs and filtering them down to suitable candidates for large-scale clone detection. The candidates were evaluated on two public industrial datasets, BigCloneBench, and a commercial large-scale dataset. No uniformly 'best-LLM' emerged, though CodeT5+110M, CuBERT and SPTCode were top-performers. Analysis of LLM-candidates suggested that smaller embedding sizes, smaller tokenizer vocabularies and tailored datasets are advantageous. On commercial large-scale dataset a top-performing CodeT5+110M achieved 39.71\% precision: twice the precision of previously used CodeBERT. To address the second question, this paper explores ensembling of the selected LLMs: effort-effective approach to improving effectiveness. Results suggest the importance of score normalization and favoring ensembling methods like maximum or sum over averaging. Also, findings indicate that ensembling approach can be statistically significant and effective on larger datasets: the best-performing ensemble achieved even higher precision of 46.91\% over individual LLM on the commercial large-scale code.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

代码克隆检测 LLM模型 模型选择 集成策略 数据集
相关文章