热点
"可靠性评估" 相关文章
HIP-LLM: A Hierarchical Imprecise Probability Approach to Reliability Assessment of Large Language Models
cs.AI updates on arXiv.org 2025-11-05T05:24:45.000000Z
Benchmarking Reasoning Reliability in Artificial Intelligence Models for Energy-System Analysis
cs.AI updates on arXiv.org 2025-10-24T04:15:11.000000Z
Benchmarking Reasoning Reliability in Artificial Intelligence Models for Energy-System Analysis
cs.AI updates on arXiv.org 2025-10-24T04:15:11.000000Z
TREAT: A Code LLMs Trustworthiness / Reliability Evaluation and Testing Framework
cs.AI updates on arXiv.org 2025-10-21T04:27:54.000000Z
Reliability of Large Language Model Generated Clinical Reasoning in Assisted Reproductive Technology: Blinded Comparative Evaluation Study
cs.AI updates on arXiv.org 2025-10-21T04:09:21.000000Z
Pass@k vs Pass^k: Understanding Agent Reliability
philschmid RSS feed 2025-09-30T11:08:48.000000Z
A Comprehensive Survey on the Trustworthiness of Large Language Models in Healthcare
cs.AI updates on arXiv.org 2025-09-18T05:01:08.000000Z
Emulating Public Opinion: A Proof-of-Concept of AI-Generated Synthetic Survey Responses for the Chilean Case
cs.AI updates on arXiv.org 2025-09-15T08:26:00.000000Z
Continuous Monitoring of Large-Scale Generative AI via Deterministic Knowledge Graph Structures
cs.AI updates on arXiv.org 2025-09-05T04:45:23.000000Z
Sacred or Synthetic? Evaluating LLM Reliability and Abstention for Religious Questions
cs.AI updates on arXiv.org 2025-08-13T04:15:15.000000Z
Can LLMs Detect Their Confabulations? Estimating Reliability in Uncertainty-Aware Language Models
cs.AI updates on arXiv.org 2025-08-12T04:02:03.000000Z
Confidence-Diversity Calibration of AI Judgement Enables Reliable Qualitative Coding
cs.AI updates on arXiv.org 2025-08-05T11:28:57.000000Z
Towards a rigorous evaluation of RAG systems: the challenge of due diligence
cs.AI updates on arXiv.org 2025-07-30T04:12:00.000000Z
ReliabilityBench: Measuring the Unpredictable Performance of Shaped-Up Large Language Models Across Five Key Domains of Human Cognition
MarkTechPost@AI 2024-09-28T12:20:50.000000Z