AI News 12小时前
AI基准测试的局限性可能误导企业决策
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

一项新的学术评审指出,当前主流的人工智能(AI)基准测试存在系统性缺陷,可能导致企业基于“误导性”数据做出高风险决策。许多企业正投入巨资开发生成式AI项目,其决策常依赖公开排行榜和基准测试来评估模型能力。然而,一项对445个大型语言模型(LLM)基准测试的研究发现,几乎所有测试都存在测量上的不足,影响了模型性能声明的可靠性。这些缺陷包括定义模糊、缺乏统计严谨性、数据污染以及数据集不具代表性等。研究强调,企业不应依赖通用AI基准测试,而应建立内部、领域特定的评估体系,以确保AI投资的有效性和安全性。

📊 **AI基准测试存在普遍的“构造效度”问题**:研究发现,许多AI基准测试未能准确测量其声称的概念,例如“安全性”或“鲁棒性”。这可能导致企业选择的模型无法真正满足业务需求,反而带来财务和声誉风险。

📉 **评估方法缺乏科学严谨性**:近八成基准测试缺乏不确定性估计或统计检验,使得模型间的微小性能差异难以区分是真实能力还是随机波动。此外,数据污染(模型预训练数据包含测试题目)和不具代表性的数据集(如使用过时或过于简化的数据)进一步削弱了测试结果的有效性。

💡 **企业需建立内部、领域特定的AI评估体系**:鉴于公开基准测试的局限性,企业应停止盲目信任通用排行榜,转而构建能够反映真实业务场景、数据和挑战的内部评估方法。这包括精确定义待测现象、构建代表性数据集、进行详细的错误分析,并论证评估方法与实际业务价值的相关性。

A new academic review suggests AI benchmarks are flawed, potentially leading an enterprise to make high-stakes decisions on “misleading” data.

Enterprise leaders are committing budgets of eight or nine figures to generative AI programmes. These procurement and development decisions often rely on public leaderboards and benchmarks to compare model capabilities.

A large-scale study, ‘Measuring what Matters: Construct Validity in Large Language Model Benchmarks,’ analysed 445 separate LLM benchmarks from leading AI conferences. A team of 29 expert reviewers found that “almost all articles have weaknesses in at least one area,” undermining the claims they make about model performance.

For CTOs and Chief Data Officers, it strikes at the heart of AI governance and investment strategy. If a benchmark claiming to measure ‘safety’ or ‘robustness’ doesn’t actually capture those qualities, an organisation could deploy a model that exposes it to serious financial and reputational risk.

The ‘construct validity’ problem

The researchers focused on a core scientific principle known as construct validity. In simple terms, this is the degree to which a test measures the abstract concept it claims to be measuring.

For example, while ‘intelligence’ cannot be measured directly, tests are created to serve as measurable proxies. The paper notes that if a benchmark has low construct validity, “then a high score may be irrelevant or even misleading”.

This problem is widespread in AI evaluation. The study found that key concepts are often “poorly defined or operationalised”. This can lead to “poorly supported scientific claims, misdirected research, and policy implications that are not grounded in robust evidence”.

When vendors compete for enterprise contracts by highlighting their top scores on benchmarks, leaders are effectively trusting that these scores are a reliable proxy for real-world business performance. This new research suggests that trust may be misplaced.

Where the enterprise AI benchmarks are failing

The review identified systemic failings across the board, from how benchmarks are designed to how their results are reported.

Vague or contested definitions: You cannot measure what you cannot define. The study found that even when definitions for a phenomenon were provided, 47.8 percent were “contested,” addressing concepts with “many possible definitions or no clear definition at all”.

The paper uses ‘harmlessness’ – a key goal in enterprise safety alignment – as an example of a phenomenon that often lacks a clear, agreed-upon definition. If two vendors score differently on a ‘harmlessness’ benchmark, it may only reflect two different, arbitrary definitions of the term, not a genuine difference in model safety.

Lack of statistical rigour: Perhaps most alarming for data-driven organisations, the review found that only 16 percent of the 445 benchmarks used uncertainty estimates or statistical tests to compare model results.

Without statistical analysis, it’s impossible to know if a 2 percent lead for Model A over Model B is a genuine capability difference or simple random chance. Enterprise decisions are being guided by numbers that would not pass a basic scientific or business intelligence review.

Data contamination and memorisation: Many benchmarks, especially those for reasoning (like the widely used GSM8K), are undermined when their questions and answers appear in the model’s pre-training data.

When this happens, the model isn’t reasoning to find the answer; it’s simply memorising it. A high score may indicate a good memory, not the advanced reasoning capability an enterprise actually needs for a complex task. The paper warns this “undermine[s] the validity of the results” and recommends building contamination checks directly into the benchmark.

Unrepresentative datasets: The study found that 27 percent of benchmarks used “convenience sampling,” such as reusing data from existing benchmarks or human exams. This data is often not representative of the real-world phenomenon.

For example, the authors note that reusing questions from a “calculator-free exam” means the problems use numbers chosen to be easy for basic arithmetic. A model might score well on this test, but this score “would not predict performance on larger numbers, where LLMs struggle”. This creates a critical blind spot, hiding a known model weakness.

From public metrics to internal validation

For enterprise leaders, the study serves as a strong warning: public AI benchmarks are not a substitute for internal and domain-specific evaluation. A high score on a public leaderboard is not a guarantee of fitness for a specific business purpose.

Isabella Grandi, Director for Data Strategy & Governance, at NTT DATA UK&I, commented: “A single benchmark might not be the right way to capture the complexity of AI systems, and expecting it to do so risks reducing progress to a numbers game rather than a measure of real-world responsibility. What matters most is consistent evaluation against clear principles that ensure technology serves people as well as progress.

“Good methodology – as laid out by ISO/IEC 42001:2023 – reflects this balance through five core principles: accountability, fairness, transparency, security and redress. Accountability establishes ownership and responsibility for any AI system that is deployed. Transparency and fairness guide decisions toward outcomes that are ethical and explainable. Security and privacy are non-negotiable, preventing misuse and reinforcing public trust. Redress and contestability provide a vital mechanism for oversight, ensuring people can challenge and correct outcomes when necessary.

“Real progress in AI depends on collaboration that brings together the vision of government, the curiosity of academia and the practical drive of industry. When partnerships are underpinned by open dialogue and shared standards take hold, it builds the transparency needed for people to instil trust in AI systems. Responsible innovation will always rely on cooperation that strengthens oversight while keeping ambition alive.”

The paper’s eight recommendations provide a practical checklist for any enterprise looking to build its own internal AI benchmarks and evaluations, aligning with the principles-based approach.

The race to deploy generative AI is pushing organisations to move faster than their governance frameworks can keep up. This report shows that the very tools used to measure progress are often flawed. The only reliable path forward is to stop trusting generic AI benchmarks and start “measuring what matters” for your own enterprise.

See also: OpenAI spreads $600B cloud AI bet across AWS, Oracle, Microsoft

Want to learn more about AI and big data from industry leaders? Check out AI & Big Data Expo taking place in Amsterdam, California, and London. The comprehensive event is part of TechEx and is co-located with other leading technology events including the Cyber Security Expo, click here for more information.

AI News is powered by TechForge Media. Explore other upcoming enterprise technology events and webinars here.

The post Flawed AI benchmarks put enterprise budgets at risk appeared first on AI News.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI基准测试 生成式AI 企业决策 AI治理 模型评估 AI Benchmarks Generative AI Enterprise Decision-Making AI Governance Model Evaluation
相关文章