热点
关于我们
xx
xx
"
评估指标
" 相关文章
Driving scenario generation and evaluation using a structured layer representation and foundational models
cs.AI updates on arXiv.org
2025-11-05T05:30:38.000000Z
SecureReviewer: Enhancing Large Language Models for Secure Code Review through Secure-aware Fine-tuning
cs.AI updates on arXiv.org
2025-10-31T04:08:08.000000Z
The Quest for Reliable Metrics of Responsible AI
cs.AI updates on arXiv.org
2025-10-31T04:05:01.000000Z
ClaimGen-CN: A Large-scale Chinese Dataset for Legal Claim Generation
cs.AI updates on arXiv.org
2025-10-28T04:14:40.000000Z
PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions
cs.AI updates on arXiv.org
2025-10-23T04:15:20.000000Z
PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions
cs.AI updates on arXiv.org
2025-10-23T04:15:20.000000Z
Invoice Information Extraction: Methods and Performance Evaluation
cs.AI updates on arXiv.org
2025-10-20T04:09:45.000000Z
Invoice Information Extraction: Methods and Performance Evaluation
cs.AI updates on arXiv.org
2025-10-20T04:09:45.000000Z
On the Design and Evaluation of Human-centered Explainable AI Systems: A Systematic Review and Taxonomy
cs.AI updates on arXiv.org
2025-10-15T04:39:34.000000Z
What Is Your Agent's GPA? A Framework for Evaluating Agent Goal-Plan-Action Alignment
cs.AI updates on arXiv.org
2025-10-13T04:09:00.000000Z
What Is Your Agent's GPA? A Framework for Evaluating Agent Goal-Plan-Action Alignment
cs.AI updates on arXiv.org
2025-10-13T04:09:00.000000Z
ARISE: An Adaptive Resolution-Aware Metric for Test-Time Scaling Evaluation in Large Reasoning Models
cs.AI updates on arXiv.org
2025-10-08T04:07:25.000000Z
WebRenderBench: Enhancing Web Interface Generation through Layout-Style Consistency and Reinforcement Learning
cs.AI updates on arXiv.org
2025-10-07T04:07:46.000000Z
Reward Models are Metrics in a Trench Coat
cs.AI updates on arXiv.org
2025-10-06T04:28:23.000000Z
Reward Models are Metrics in a Trench Coat
cs.AI updates on arXiv.org
2025-10-06T04:28:23.000000Z
Detection of Chagas Disease from the ECG: The George B. Moody PhysioNet Challenge 2025
cs.AI updates on arXiv.org
2025-10-03T04:18:38.000000Z
FINCH: Financial Intelligence using Natural language for Contextualized SQL Handling
cs.AI updates on arXiv.org
2025-10-03T04:18:08.000000Z
Can LLMs Refuse Questions They Do Not Know? Measuring Knowledge-Aware Refusal in Factual Tasks
cs.AI updates on arXiv.org
2025-10-03T04:17:58.000000Z
Can LLMs Refuse Questions They Do Not Know? Measuring Knowledge-Aware Refusal in Factual Tasks
cs.AI updates on arXiv.org
2025-10-03T04:17:58.000000Z
Mailbag: How to Bootstrap Labels for Relevant Docs in Search
https://eugeneyan.com/rss
2025-09-30T11:12:10.000000Z