AI Snake Oil 09月12日
AI代码生成:准确性与成本的权衡
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了当前AI代码生成系统的准确性评估问题,指出仅关注准确率而忽略成本的评估方法并无实际意义。研究发现,许多复杂的AI代理系统虽然在基准测试中表现出色,但其运行成本却远高于简单的基线方法,且准确度提升有限。文章强调,评估AI系统时应将成本纳入考量,并提出使用Pareto曲线可视化准确性与成本的权衡。此外,研究还揭示了当前AI代理评估中存在的标准化和可复现性不足的问题,并呼吁更严谨的评估方法。

📊 **成本与准确性的权衡至关重要:** 仅以准确率为导向的AI代码生成系统评估方法存在局限性,实际应用中需要考量运行成本。研究表明,一些复杂的AI代理系统虽然在HumanEval等基准测试中达到较高准确率,但其成本可能比简单的基线方法高出数个数量级,这种高成本是否带来同等价值值得商榷。因此,评估AI系统时,必须将成本作为核心指标之一,并可视化准确性与成本之间的权衡关系,例如通过Pareto曲线。

💡 **简单基线方法可能更具成本效益:** 研究发现,一些看似复杂的AI代理架构,在代码生成任务上并未显著优于简单的基线方法(如重试、温度调整或模型升级策略),甚至在某些情况下成本高出50倍以上。这表明,当前流行的代理架构可能并非最优选择,简单的、成本更低的策略可能在准确性和成本之间提供更好的平衡点,为开发者提供了新的思路。

🔍 **评估标准化与可复现性亟待提升:** 当前AI代理评估存在严重的标准化和可复现性问题。研究者在尝试复现已发表的实验结果时遇到了困难,不同研究在测试集使用、模型版本选择、评估细节等方面存在差异,导致结果难以比较和验证。这种不透明性加剧了对AI代理能力的过度乐观,阻碍了科研的健康发展,亟需建立统一的评估框架和标准。

📈 **参数量等代理指标具有误导性:** 在评估AI系统时,使用参数量等指标作为成本的代理是具有误导性的。实际的运行成本,尤其是在下游应用中,直接与美元成本挂钩。不同的模型开发者可能利用不同的优化策略或拥有不同的规模经济效应,导致参数量与实际成本之间存在较大偏差。因此,对于下游开发者而言,直接衡量美元成本并结合准确性进行评估,才是做出明智选择的关键。

By Sayash Kapoor, Benedikt Stroebl, Arvind Narayanan

Which is the most accurate AI system for generating code? Surprisingly, there isn’t currently a good way to answer questions like these. 

Based on HumanEval, a widely used benchmark for code generation, the most accurate publicly available system is LDB (short for LLM debugger).1 But there’s a catch. The most accurate generative AI systems, including LDB, tend to be agents,2 which repeatedly invoke language models like GPT-4. That means they can be orders of magnitude more costly to run than the models themselves (which are already pretty costly). If we eke out a 2% accuracy improvement for 100x the cost, is that really better?

In this post, we argue that:

Maximizing accuracy can lead to unbounded cost

LLMs are stochastic. Simply calling a model many times and outputting the most common answer can increase accuracy. 

On some tasks, there is seemingly no limit to the amount of inference compute that can improve accuracy.3 Google Deepmind's AlphaCode, which improved accuracy on automated coding evaluations, showed that this trend holds even when calling LLMs millions of times.

The accuracy of AlphaCode on coding tasks continues to improve even after making a million calls to the underlying model (the different curves represent varying parameter counts). Accuracy is measured by how often one of the top 10 answers generated by the model is correct.

A useful evaluation of agents must therefore ask: What did it cost? If we don’t do cost-controlled comparisons, it will encourage researchers to develop extremely costly agents just to claim they topped the leaderboard.

In fact, when we evaluate agents that have been proposed in the last year for solving coding tasks, we find that visualizing the tradeoff between cost and accuracy yields surprising insights. 

Subscribe to AI Snake Oil receive new posts.

Visualizing the accuracy-cost tradeoff on HumanEval, with new baselines

We re-evaluated the accuracy of three agents that have been claimed to occupy top spots on the HumanEval leaderboard: LDB, LATS, and Reflexion.4 We also evaluated the cost and time requirements of running these agents. 

These agents rely on running the code generated by the model, and if it fails the test cases provided with the problem description, they try to debug the code, look at alternative paths in the code generation process, or "reflect" on why the model's outputs were incorrect before generating another solution. 

In addition, we calculated the accuracy, cost, and running time of  a few simple baselines: 

Surprisingly, we are not aware of any papers that compare their proposed agent architectures with any of the latter three simple baselines.

Our most striking result is that agent architectures for HumanEval do not outperform our simpler baselines despite costing more. In fact, agents differ drastically in terms of cost: for substantially similar accuracy, the cost can differ by almost two orders of magnitude!8 Yet, the cost of running these agents isn't a top-line metric reported in any of these papers.9

Our simple baselines offer Pareto improvements over existing agent architectures. We run each agent five times and report the mean accuracy and the mean total cost on the 164 HumanEval problems. Where results for LDB have two models/agents in parenthesis, they indicate the language model or agent used to generate the code, followed by the language model used to debug the code. Where they have just one, they indicate that the same model was used to both generate the code and debug it. Note that the y-axis is shown from 0.7 to 1; figures with the full axis (0 to 1) and error bars, robustness checks, and other details about our empirical results are included in the appendix

There is no significant accuracy difference between the warming strategy and the best-performing agent architecture. Yet, Reflexion and LDB cost over 50% more than the warming strategy,10 and LATS over 50 times more (all these costs are entirely or predominantly from calls to GPT-4, so these ratios will be stable even if model costs change).  Meanwhile, the escalation strategy strictly improves accuracy while costing less than half of LDB (GPT-3.5) at current inference prices.11 

Our results point to another underlying problem: papers making claims about the usefulness of agents have so far failed to test if simple agent baselines can lead to similar accuracy. This has led to widespread beliefs among AI researchers that complex ideas like planning, reflection, and debugging are responsible for accuracy gains. In fact, Lipton and Steinhardt noted a trend in the AI literature of failing to identify the sources of empirical gains back in 2018.

Based on our findings, the question of whether debugging, reflection, and other such “System 2” approaches are useful for code generation remains open.12 It is possible that they will be useful on harder programming tasks than those represented in HumanEval. For now, the over-optimism about System 2 approaches is exacerbated by a lack of reproducibility and standardization that we report below.13

Proxies for cost are misleading

At first glance, reporting dollar costs is jarring. It breaks many properties of benchmarking that we take for granted: that measurements don’t change over time (whereas costs tend to come down) and that different models compete on a level playing field (whereas some developers may benefit from economies of scale, leading to lower inference costs). Because of this, researchers usually pick a different axis for the Pareto curve, such as parameter count.

The downsides of reporting costs are real, but we describe below how they can be mitigated. More importantly, we think using attributes like parameter count as a proxy for cost is a mistake and doesn’t solve the problem it’s intended to solve. To understand why, we need to introduce a conceptual distinction.

AI evaluations serve at least two distinct purposes. Model developers and AI researchers use them to identify which changes to the training data and architecture improve accuracy. We call this model evaluation. And downstream developers, such as programmers who use AI to build consumer-facing products, use evaluations to decide which AI systems to use in their products. We call this downstream evaluation

The difference between model evaluation and downstream evaluation is underappreciated. This has led to much confusion about how to factor in the cost of running AI.

Model evaluation is a scientific question of interest to researchers. So it makes sense to stay away from dollar costs for the aforementioned reasons. Instead, controlling for compute is a reasonable approach: if we normalize the amount of compute used to train a model, we can then understand if factors like architectural changes or changes in the data composition are responsible for improvements, as opposed to more compute. Notably, Nathan Lambert argues that many of the accuracy gains in the last year (such as Meta's Llama 2) are simply consequences of using more compute. 

On the other hand, downstream evaluation is an engineering question that helps inform a procurement decision. Here, cost is the actual construct of interest. The downsides of cost measurement aren’t downsides at all; they are exactly what’s needed. Inference costs do come down over time, and that greatly matters to downstream developers. It is unnecessary and counterproductive for the evaluation to stay frozen in time.

In this context, proxies for cost (such as the number of active parameters or amount of compute used) are misleading. For example, Mistral released the figure below alongside their latest model, Mixtral 8x22B, to explain why developers should choose it over competitors. 

Substituting active parameters as a proxy for cost is misleading. Source: Mistral.

In this figure, the number of active parameters is a poor proxy for cost. On Anyscale, Mixtral 8x7B costs twice as much as Llama 2 13B, yet Mistral's figure shows it costs about the same, because they only consider the number of active parameters. Of course, downstream developers don't care about the number of active parameters when they're using an API. They simply care about the dollar cost relative to accuracy. Mistral chose “active parameters” as a proxy, presumably because it makes their models look better than dense models such as Meta’s Llama and Cohere’s Command R+. If we start using proxies for cost, every model developer can pick a proxy that makes their model look good.

Some hurdles to cost evaluation remain. Different providers can charge different amounts for the same model, the cost of an API call might change overnight, and cost might vary based on model developer decisions, such as whether bulk API calls are charged differently. These downsides can be partly addressed by making the evaluation results customizable using mechanisms to adjust the cost of running models, i.e., providing users the option to adjust the cost of input and output tokens for their provider of choice to recalculate the tradeoff between cost and accuracy. In turn, downstream evaluations of agents should include input/output token counts in addition to dollar costs, so that anyone looking at the evaluation in the future can instantly recalculate the cost using current prices.

But ultimately, despite the hurdles, good measurement requires modeling the underlying construct of interest. For downstream evaluations, that underlying construct is cost. All other proxies are lacking.

Agent evaluations lack standardization and reproducibility

In the course of our evaluation, we found many shortcomings in the reproducibility and standardization of agent evaluations.

These shortcomings in the empirical results have also led to errors of interpretation in broader discussions around the accuracy of AI agents. For example, a recent post by Andrew Ng claimed that agents that use GPT-3.5 can outperform GPT-4. In particular, he claimed: 

[For HumanEval,] GPT-3.5 (zero shot) was 48.1% correct. GPT-4 (zero shot) does better at 67.0%. However, the improvement from GPT-3.5 to GPT-4 is dwarfed by incorporating an iterative agent workflow. Indeed, wrapped in an agent loop, GPT-3.5 achieves up to 95.1%. 

While this claim received a lot of attention, it is incorrect. The claim ("GPT-3.5 wrapped in an agent workflow achieves 95.1% accuracy") seems to be about the LDB agent. The Papers With Code leaderboard for HumanEval makes the same claim. However, as we discussed above, for LDB, GPT-3.5 is only used to find bugs. The code is generated using GPT-4 (or the Reflexion agent that uses GPT-4), not GPT-3.5. Unfortunately, the error in the paper has led to much overoptimism about agents in the broader AI community.

Ng's post also makes the familiar error of repeating results from papers without verifying them or accounting for changes in prompts and model versions. For example, the zero-shot accuracy numbers of GPT-3.5 (48.1%) and GPT-4 (67.0%) seem to be copied from the GPT-4 technical report from March 2023. However, the models have been updated many times since release. Indeed, in our comparison, we find that the base models perform much better compared to the claimed figures in Ng's post when we use them with the prompts provided with the LDB paper (GPT-3.5: 73.9%, GPT-4: 89.6%). As a result, the post drastically overestimates the improvement attributable to agent architectures.

Evaluation frameworks like Stanford's HELM and EleutherAI's LM Evaluation Harness attempt to fix similar shortcomings for model evaluations, by providing standardized evaluation results. We are working on solutions to make agent evaluations standardized and reproducible, especially from the perspective of downstream evaluation of agents.

Finally, downstream developers should keep in mind that HumanEval or any other standardized benchmark is nothing more than a rough proxy for the specific tasks that arise in a particular downstream application. To understand how agents will perform in practice, it is necessary to evaluate them on a custom dataset from the domain of interest — or even better, A/B test different agents in the production environment.

Further reading

The code for reproducing our analysis is available here. The appendix includes more details about our setup and results.

Acknowledgments

We thank Rishi Bommasani, Rumman Chowdhury, Percy Liang, Shayne Longpre, Yifan Mai, Nitya Nadgir, Matt Salganik, Hailey Schoelkopf, Zachary Siegel, and Venia Veselovsky for discussions and inputs that informed our analysis. We acknowledge Cunxiang Wang and Ruoxi Ning for their prompt responses to our questions about the NovelQA benchmark. 

We are grateful to the authors of the papers we engage with in this post for their quick responses and for sharing their code, which makes such reproduction analysis possible in the first place. In particular, we are grateful to Zilong Wang (LDB), Andy Zhou (LATS), and Karthik Narasimhan (Reflexion), who gave us feedback in response to an earlier draft of this blog post.

1

The leaderboard on the linked page lists AgentCoder as the most accurate system. However, the code or data for reproducing the results of this agent are not available online, so we do not consider it in this blog post.

2

This post is about agents. Leaderboards are also becoming less useful for evaluating the underlying models. There are many problems, including gameability. But controlling for inference cost isn’t the main problem, so our arguments don’t necessarily apply.

3

Tasks where increased compute could help indefinitely are primarily those where verifying whether a solution is correct is easy. In the case of programming questions, this takes the form of test cases that are provided with each question to check if the answer is correct. Other examples include proving theorems, because verifying if a theorem is correct can be straightforward, as well as some tasks on the internet for agents that navigate the web. That said, even for tasks where there is no way to guess a solution and then verify, the costs of different agents can vary by orders of magnitude.

4

We included agents from the HumanEval leaderboard on PapersWithCode that share their code publicly. Reflexion is absent from the PapersWithCode list, but it has a reported accuracy of 91% (higher than any other agents with publicly available code apart from LDB and LATS), so we included it too.

5

For the model evaluation, we only used the description of the coding problem as well as the example tests provided with the HumanEval dataset. Three of the 164 coding problems in HumanEval lack example tests. The authors of LDB include a modified version of HumanEval with example tests included for these three problems. We use this modified version for all experiments.

6

In all of the baselines we provide, we don't use the test cases used to evaluate if the solution is correct when deciding to retry, only the ones in the problem description, to avoid leakage.

7

We evaluated Llama-3 using together.ai endpoints. The cost per million tokens on together.ai, for both prompt and completion, is 0.20$ and 0.90$ for Llama-3-8B and Llama-3-70B, respectively.

8

This is also true for other desired properties of agents, such as running time. We report results for time vs. accuracy tradeoffs in the appendix.

9

While some of the papers introducing these models discuss cost abstractly, such as the relationship between cost and number of times an agent retries, they don't report any concrete numbers on cost or compare token count to a baseline.

10

The cost comparison is for LDB (Reflexion, GPT-3.5), since that is the top-performing agent reported by the authors of LDB. 

11

In addition to HumanEval, we also ran experiments on the HotPotQA and NovelQA benchmarks for question answering. We found similar results for both benchmarks. In particular, we found that there can be large differences in cost underlying small improvements in accuracy for both benchmarks.

12

One potential concern with our analysis is that while we relied on the April 2024 version of OpenAI models, many papers relied on older model versions for their results. To address this, we report results for an additional robustness check with the June 2023 version of OpenAI models in the appendix; we find substantially similar results across model versions.

13

While HumanEval is commonly used to evaluate how well AI can solve coding problems, it is limited due to its small size (only 164 questions), lack of difficult problems (none of the problems involve real-world tasks), and potential contamination, since language models have likely been trained on HumanEval problems, which might inflate the performance of the simple baselines we test. A more rigorous examination of hypotheses related to whether System 2 thinking helps will likely require the use of more comprehensive and robust benchmarks, such as SWE-bench.

14

LDB uses already-existing solutions to improve them by debugging. The existing solutions can come from models like GPT-3.5 or GPT-4, or from agents like Reflexion. Since the authors of Reflexion provided all of the generated solutions in their Github repo, the authors of the LDB paper used code from the original Reflexion repository to run their analysis, rather than rerunning the Reflexion agent. The difference between the reported results and our reproduced results could be due to differences in the code generated by the Reflexion agent. Reusing Reflexion solutions is a reasonable choice for evaluating the usefulness of debugging (indeed, we see LDB increases the accuracy over using the models alone). The problem arises when their final accuracy is interpreted as a downstream evaluation, since it might give developers an inflated estimate of the accuracy of such techniques for coding.

15

The authors acknowledge this and plan to update their results.

16

The authors of LDB only tested the GPT-3.5 model as the debugger, which performed notably worse than the agent using GPT-4 as the debugger, with an accuracy of 88.9% for LDB (GPT-3.5 + Reflexion) vs. 92.9% for LDB (GPT-4 + Reflexion).

17

In correspondence with the authors of LATS, they clarified: "Originally, there was an execution error when evaluating some test cases for [one of the HumanEval test cases], so we opted to remove it from our setting."

18

The authors acknowledge this and plan to update the paper to address it.

19

The authors acknowledge this and plan to update the paper to address it.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI代码生成 人工智能评估 成本效益 基准测试 可复现性 AI Agents Code Generation AI Evaluation Cost-Effectiveness Benchmarking Reproducibility
相关文章