AI News 09月26日
三星发布TRUEBench,评估企业AI模型真实生产力
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

三星研究院开发了名为TRUEBench的全新基准测试系统,旨在解决现有AI模型评估方法与企业实际应用需求之间的差距。该系统专注于衡量大型语言模型(LLMs)在真实世界生产环境中的表现,而非仅限于学术或通用知识测试。TRUEBench能够评估内容创建、数据分析、文档摘要和翻译等企业常用功能,支持12种语言的2485个多样化测试集,并考虑了隐性用户意图。通过人机协作的评分标准和严格的“全有或全无”评分模式,TRUEBench提供了更精细、更可靠的企业AI模型生产力评估,并已在Hugging Face上公开相关数据和排行榜,以促进行业透明度和AI模型选择的优化。

🏢 **填补企业AI评估空白**:三星推出的TRUEBench系统,专门为企业级AI模型(特别是大型语言模型)的真实生产力评估而设计。它解决了现有基准测试多集中于学术或通用知识、语言局限性强等问题,为企业提供了一个衡量AI在复杂、多语言、情境化业务任务中实际效用的可靠方法。

📊 **全面评估核心企业功能**:TRUEBench框架能够对企业最常用的人工智能功能进行细致评估,包括但不限于内容生成、数据分析、长文档摘要以及材料翻译。这些功能被分解为10个主要类别和46个子类别,从而能够对AI的生产力能力进行高度细粒度的审视,确保评估的全面性和深度。

🌐 **支持多语言与复杂场景**:该基准测试建立在包含2485个多样化测试集的基础上,横跨12种不同语言,并支持跨语言场景。测试材料的长度也极具代表性,从仅8个字符的简短指令到超过20,000个字符的复杂文档分析,都力求模拟真实工作环境的需求,特别关注AI理解和满足隐性企业需求的能力。

🤝 **创新的人机协作评分机制**:为了确保评估标准的精确性和真实性,TRUEBench采用了一种独特的人机协作流程。首先由人类专家制定评估标准,然后AI进行审查以发现潜在错误或不切实际的约束,最后人类专家根据AI反馈优化标准。这种迭代过程确保了最终评分标准的准确性,并通过AI自动化评分,最大限度地减少了主观偏见,保证了评估的一致性和可靠性。

Samsung is overcoming limitations of existing benchmarks to better assess the real-world productivity of AI models in enterprise settings. The new system, developed by Samsung Research and named TRUEBench, aims to address the growing disparity between theoretical AI performance and its actual utility in the workplace.

As businesses worldwide accelerate their adoption of large language models (LLMs) to improve their operations, a challenge has emerged: how to accurately gauge their effectiveness. Many existing benchmarks focus on academic or general knowledge tests, often limited to English and simple question and answer formats. This has created a gap that leaves enterprises without a reliable method for evaluating how an AI model will perform on complex, multilingual, and context-rich business tasks.

Samsung’s TRUEBench, short for Trustworthy Real-world Usage Evaluation Benchmark, has been developed to fill this void. It provides a comprehensive suite of metrics that assesses LLMs based on scenarios and tasks directly relevant to real-world corporate environments. The benchmark draws upon Samsung’s own extensive internal enterprise use of AI models, ensuring the evaluation criteria are grounded in genuine workplace demands.

The framework evaluates common enterprise functions such as creating content, analysing data, summarising lengthy documents, and translating materials. These are broken down into 10 distinct categories and 46 sub-categories, providing a granular view of an AI’s productivity capabilities.

“Samsung Research brings deep expertise and a competitive edge through its real-world AI experience,” said Paul (Kyungwhoon) Cheun, CTO of the DX Division at Samsung Electronics and Head of Samsung Research. “We expect TRUEBench to establish evaluation standards for productivity.”

To tackle the limitations of older benchmarks, TRUEBench is built upon a foundation of 2,485 diverse test sets spanning 12 different languages and supporting cross-linguistic scenarios. This multilingual approach is critical for global corporations where information flows across different regions. The test materials themselves reflect the variety of workplace requests, ranging from brief instructions of just eight characters to the complex analysis of documents exceeding 20,000 characters.

Samsung recognised that in a real business context, a user’s full intent is not always explicitly stated in their initial prompt. The benchmark is therefore designed to assess an AI model’s ability to understand and fulfil these implicit enterprise needs, moving beyond simple accuracy to a more nuanced measure of helpfulness and relevance.

To achieve this, Samsung Research developed a unique collaborative process between human experts and AI to create the productivity scoring criteria. Initially, human annotators establish the evaluation standards for a given task. An AI then reviews these standards, checking for potential errors, internal contradictions, or unnecessary constraints that might not reflect a realistic user expectation. Following the AI’s feedback, the human annotators refine the criteria. This iterative loop ensures the final evaluation standards are precise and reflective of a high-quality outcome.

This cross-verified process delivers an automated evaluation system that scores the performance of LLMs. By using AI to apply these refined criteria, the system minimises the subjective bias that can occur with human-only scoring, ensuring consistency and reliability across all tests. TRUEBench also employs a strict scoring model where an AI model must satisfy every condition associated with a test to receive a passing mark. This all or nothing approach for individual conditions enables a more detailed and exacting assessment of the performance of AI models across different enterprise tasks.

To boost transparency and encourage wider adoption, Samsung has made TRUEBench’s data samples and leaderboards publicly available on the global open-source platform Hugging Face. This allows developers, researchers, and enterprises to directly compare the productivity performance of up to five different AI models simultaneously. The platform provides a clear, at a glance overview of how various AIs stack up against each other on practical tasks.

As of writing, here are the top 20 models by overall ranking based on Samsung’s AI benchmark:

The full published data also includes the average length of the AI-generated responses. This allows for a simultaneous comparison of not only performance but also efficiency, a key consideration for businesses weighing operational costs and speed.

With the launch of TRUEBench, Samsung is not merely releasing another tool but is aiming to change how the industry thinks about AI performance. By moving the goalposts from abstract knowledge to tangible productivity, Samsung’s benchmark could play a role in helping organisations make better decisions about which enterprise AI models to integrate into their workflows and bridge the gap between an AI’s potential and its proven value.

See also: Inside Huawei’s plan to make thousands of AI chips think like one computer

Want to learn more about AI and big data from industry leaders? Check out AI & Big Data Expo taking place in Amsterdam, California, and London. The comprehensive event is part of TechEx and is co-located with other leading technology events, click here for more information.

AI News is powered by TechForge Media. Explore other upcoming enterprise technology events and webinars here.

The post Samsung benchmarks real productivity of enterprise AI models appeared first on AI News.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

TRUEBench 三星 企业AI 生产力评估 大型语言模型 基准测试 Samsung Enterprise AI Productivity Assessment LLM Benchmark
相关文章