MULocBench：构建更全面的软件问题定位基准

cs.AI updates on arXiv.org 10月01日 13:59

MULocBench：构建更全面的软件问题定位基准

本文介绍MULocBench，一个包含1100个GitHub Python项目问题的综合数据集，旨在为软件问题定位提供更全面的基准，并评估现有方法的性能。

arXiv:2509.25242v1 Announce Type: cross Abstract: Accurate project localization (e.g., files and functions) for issue resolution is a critical first step in software maintenance. However, existing benchmarks for issue localization, such as SWE-Bench and LocBench, are limited. They focus predominantly on pull-request issues and code locations, ignoring other evidence and non-code files such as commits, comments, configurations, and documentation. To address this gap, we introduce MULocBench, a comprehensive dataset of 1,100 issues from 46 popular GitHub Python projects. Comparing with existing benchmarks, MULocBench offers greater diversity in issue types, root causes, location scopes, and file types, providing a more realistic testbed for evaluation. Using this benchmark, we assess the performance of state-of-the-art localization methods and five LLM-based prompting strategies. Our results reveal significant limitations in current techniques: even at the file level, performance metrics (Acc@5, F1) remain below 40%. This underscores the challenge of generalizing to realistic, multi-faceted issue resolution. To enable future research on project localization for issue resolution, we publicly release MULocBench at https://huggingface.co/datasets/somethingone/MULocBench.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

软件维护问题定位数据集基准测试

相关文章

MS MARCO Web Search: A Large-Scale Information-Rich Web Dataset Featuring Millions of Real Clicked Query-Document Labels

Cross-Device AI Acceleration, Compilation & Execution with Jeff Gehlhaar - #500

This Week In Machine Learning & AI - 5/27/16: The White House on AI & Aggressive Self-Driving Cars

TIGER-Lab Introduces MMLU-Pro Dataset for Comprehensive Benchmarking of Large Language Models’ Capabilities and Performance

CinePile: A Novel Dataset and Benchmark Specifically Designed for Authentic Long-Form Video Understanding

Researchers at the University of Freiburg and Bosch AI Propose HW-GPT-Bench: A Hardware-Aware Language Model Surrogate Benchmark

‘RAG Me Up’: A Generic AI Framework (Server + UIs) that Enables You to Do RAG on Your Own Dataset Easily

HuggingFace Releases ? FineWeb: A New Large-Scale (15-Trillion Tokens, 44TB Disk Space) Dataset for LLM Pretraining

MMLU-Pro: An Enhanced Benchmark Designed to Evaluate Language Understanding Models Across Broader and More Challenging Tasks

benchexec： BenchExec：可靠的基准测试和资源测量框架