LLM化学：评估多模型协作性能的新框架

cs.AI updates on arXiv.org 10月07日

LLM化学：评估多模型协作性能的新框架

本文提出LLM化学框架，通过分析交互依赖性量化LLM组合的协同或对抗行为，评估其集体性能。研究显示，在异构模型配置下，协作LLM的化学效应最为显著，其影响受任务类型、组大小和复杂性等因素影响。

arXiv:2510.03930v1 Announce Type: cross Abstract: Multi-LLM collaboration promises accurate, robust, and context-aware solutions, yet existing approaches rely on implicit selection and output assessment without analyzing whether collaborating models truly complement or conflict. We introduce LLM Chemistry -- a framework that measures when LLM combinations exhibit synergistic or antagonistic behaviors that shape collective performance beyond individual capabilities. We formalize the notion of chemistry among LLMs, propose algorithms that quantify it by analyzing interaction dependencies, and recommend optimal model ensembles accordingly. Our theoretical analysis shows that chemistry among collaborating LLMs is most evident under heterogeneous model profiles, with its outcome impact shaped by task type, group size, and complexity. Evaluation on classification, summarization, and program repair tasks provides initial evidence for these task-dependent effects, thereby reinforcing our theoretical results. This establishes LLM Chemistry as both a diagnostic factor in multi-LLM systems and a foundation for ensemble recommendation.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LLM化学多模型协作性能评估模型组合 LLM交互

相关文章

Researchers at the University of Freiburg and Bosch AI Propose HW-GPT-Bench: A Hardware-Aware Language Model Surrogate Benchmark

Show HN: 搜索系统的评估指标

MLPerf Training 4.0 – Nvidia Still King; Power and LLM Fine Tuning Added

三千预算买华硕笔记本？! a豆14除了粉粉嫩嫩还有什么本事？

This AI Paper from UC Berkeley Research Highlights How Task Decomposition Breaks the Safety of Artificial Intelligence (AI) Systems, Leading to Misuse

Benchmark Self-Evolving ｜自我进化的大模型动态评测基准

Metron: A Holistic AI Framework for Evaluating User-Facing Performance in LLM Inference Systems

Comparing Quantized Performance in Llama Models

Evaluate conversational AI agents with Amazon Bedrock

MLPs vs KANs: Evaluating Performance in Machine Learning, Computer Vision, NLP, and Symbolic Tasks