AI Snake Oil 09月25日
AI代码生成器准确性与成本分析
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了当前最准确公开AI代码生成系统(如LDB)的成本问题,指出尽管这些系统(通常基于GPT-4等语言模型)在准确性上领先,但其运行成本远高于基础模型,导致性价比低下。文章通过对比LDB、LATS、Reflexion等复杂代理架构与简单基线(如GPT-3.5、GPT-4的Retry/Warming策略)发现,在相似准确率下,代理架构成本可高出近两个数量级。研究强调应将成本纳入评估标准,并建议使用实际美元成本而非参数数量等代理指标进行下游评估,同时呼吁建立标准化的代理评估体系以提高可复现性。

🔍 研究发现,尽管LDB、LATS等复杂代理架构在HumanEval基准上表现优异,但其运行成本远超简单基线策略(如GPT-3.5的Retry或Warming策略),在相似准确率下成本差异可达近两个数量级,表明当前复杂代理架构的性价比亟待优化。

💰 文章指出,将成本纳入AI代理评估指标至关重要,建议采用实际美元成本而非参数数量等代理指标进行下游评估,因为后者可能因模型开发者选择不当而扭曲成本效益对比。

⚠️ 研究者未能复现LATS和LDB在HumanEval上的报告结果,发现部分代理(如LATS)未使用全部164个测试用例,且LDB论文中声称的GPT-3.5代码生成实际依赖GPT-4,这些问题导致对代理实际表现的误判。

📊 文章通过绘制准确率-成本Pareto曲线揭示,简单策略(如GPT-4的Escalation策略)能在远低成本下达到与复杂代理相当甚至更高的准确率,例如在当前价格下,该策略成本仅为LDB的约一半。

🌐 研究强调,标准化和可复现性对代理评估至关重要,呼吁建立类似HELMS或LM Evaluation Harness的框架,并建议下游开发者在特定领域数据集上测试代理,以获得更贴近实际应用的表现评估。

By Sayash Kapoor, Benedikt Stroebl, Arvind Narayanan

Which is the most accurate AI system for generating code? Surprisingly, there isn’t currently a good way to answer questions like these. 

Based on HumanEval, a widely used benchmark for code generation, the most accurate publicly available system is LDB (short for LLM debugger).1 But there’s a catch. The most accurate generative AI systems, including LDB, tend to be agents,2 which repeatedly invoke language models like GPT-4. That means they can be orders of magnitude more costly to run than the models themselves (which are already pretty costly). If we eke out a 2% accuracy improvement for 100x the cost, is that really better?

In this post, we argue that:

Maximizing accuracy can lead to unbounded cost

LLMs are stochastic. Simply calling a model many times and outputting the most common answer can increase accuracy. 

On some tasks, there is seemingly no limit to the amount of inference compute that can improve accuracy.3 Google Deepmind's AlphaCode, which improved accuracy on automated coding evaluations, showed that this trend holds even when calling LLMs millions of times.

The accuracy of AlphaCode on coding tasks continues to improve even after making a million calls to the underlying model (the different curves represent varying parameter counts). Accuracy is measured by how often one of the top 10 answers generated by the model is correct.

A useful evaluation of agents must therefore ask: What did it cost? If we don’t do cost-controlled comparisons, it will encourage researchers to develop extremely costly agents just to claim they topped the leaderboard.

In fact, when we evaluate agents that have been proposed in the last year for solving coding tasks, we find that visualizing the tradeoff between cost and accuracy yields surprising insights. 

Subscribe to AI Snake Oil receive new posts.

Visualizing the accuracy-cost tradeoff on HumanEval, with new baselines

We re-evaluated the accuracy of three agents that have been claimed to occupy top spots on the HumanEval leaderboard: LDB, LATS, and Reflexion.4 We also evaluated the cost and time requirements of running these agents. 

These agents rely on running the code generated by the model, and if it fails the test cases provided with the problem description, they try to debug the code, look at alternative paths in the code generation process, or "reflect" on why the model's outputs were incorrect before generating another solution. 

In addition, we calculated the accuracy, cost, and running time of  a few simple baselines: 

Surprisingly, we are not aware of any papers that compare their proposed agent architectures with any of the latter three simple baselines.

Our most striking result is that agent architectures for HumanEval do not outperform our simpler baselines despite costing more. In fact, agents differ drastically in terms of cost: for substantially similar accuracy, the cost can differ by almost two orders of magnitude!8 Yet, the cost of running these agents isn't a top-line metric reported in any of these papers.9

Our simple baselines offer Pareto improvements over existing agent architectures. We run each agent five times and report the mean accuracy and the mean total cost on the 164 HumanEval problems. Where results for LDB have two models/agents in parenthesis, they indicate the language model or agent used to generate the code, followed by the language model used to debug the code. Where they have just one, they indicate that the same model was used to both generate the code and debug it. Note that the y-axis is shown from 0.7 to 1; figures with the full axis (0 to 1) and error bars, robustness checks, and other details about our empirical results are included in the appendix

There is no significant accuracy difference between the warming strategy and the best-performing agent architecture. Yet, Reflexion and LDB cost over 50% more than the warming strategy,10 and LATS over 50 times more (all these costs are entirely or predominantly from calls to GPT-4, so these ratios will be stable even if model costs change).  Meanwhile, the escalation strategy strictly improves accuracy while costing less than half of LDB (GPT-3.5) at current inference prices.11 

Our results point to another underlying problem: papers making claims about the usefulness of agents have so far failed to test if simple agent baselines can lead to similar accuracy. This has led to widespread beliefs among AI researchers that complex ideas like planning, reflection, and debugging are responsible for accuracy gains. In fact, Lipton and Steinhardt noted a trend in the AI literature of failing to identify the sources of empirical gains back in 2018.

Based on our findings, the question of whether debugging, reflection, and other such “System 2” approaches are useful for code generation remains open.12 It is possible that they will be useful on harder programming tasks than those represented in HumanEval. For now, the over-optimism about System 2 approaches is exacerbated by a lack of reproducibility and standardization that we report below.13

Proxies for cost are misleading

At first glance, reporting dollar costs is jarring. It breaks many properties of benchmarking that we take for granted: that measurements don’t change over time (whereas costs tend to come down) and that different models compete on a level playing field (whereas some developers may benefit from economies of scale, leading to lower inference costs). Because of this, researchers usually pick a different axis for the Pareto curve, such as parameter count.

The downsides of reporting costs are real, but we describe below how they can be mitigated. More importantly, we think using attributes like parameter count as a proxy for cost is a mistake and doesn’t solve the problem it’s intended to solve. To understand why, we need to introduce a conceptual distinction.

AI evaluations serve at least two distinct purposes. Model developers and AI researchers use them to identify which changes to the training data and architecture improve accuracy. We call this model evaluation. And downstream developers, such as programmers who use AI to build consumer-facing products, use evaluations to decide which AI systems to use in their products. We call this downstream evaluation

The difference between model evaluation and downstream evaluation is underappreciated. This has led to much confusion about how to factor in the cost of running AI.

Model evaluation is a scientific question of interest to researchers. So it makes sense to stay away from dollar costs for the aforementioned reasons. Instead, controlling for compute is a reasonable approach: if we normalize the amount of compute used to train a model, we can then understand if factors like architectural changes or changes in the data composition are responsible for improvements, as opposed to more compute. Notably, Nathan Lambert argues that many of the accuracy gains in the last year (such as Meta's Llama 2) are simply consequences of using more compute. 

On the other hand, downstream evaluation is an engineering question that helps inform a procurement decision. Here, cost is the actual construct of interest. The downsides of cost measurement aren’t downsides at all; they are exactly what’s needed. Inference costs do come down over time, and that greatly matters to downstream developers. It is unnecessary and counterproductive for the evaluation to stay frozen in time.

In this context, proxies for cost (such as the number of active parameters or amount of compute used) are misleading. For example, Mistral released the figure below alongside their latest model, Mixtral 8x22B, to explain why developers should choose it over competitors. 

Substituting active parameters as a proxy for cost is misleading. Source: Mistral.

In this figure, the number of active parameters is a poor proxy for cost. On Anyscale, Mixtral 8x7B costs twice as much as Llama 2 13B, yet Mistral's figure shows it costs about the same, because they only consider the number of active parameters. Of course, downstream developers don't care about the number of active parameters when they're using an API. They simply care about the dollar cost relative to accuracy. Mistral chose “active parameters” as a proxy, presumably because it makes their models look better than dense models such as Meta’s Llama and Cohere’s Command R+. If we start using proxies for cost, every model developer can pick a proxy that makes their model look good.

Some hurdles to cost evaluation remain. Different providers can charge different amounts for the same model, the cost of an API call might change overnight, and cost might vary based on model developer decisions, such as whether bulk API calls are charged differently. These downsides can be partly addressed by making the evaluation results customizable using mechanisms to adjust the cost of running models, i.e., providing users the option to adjust the cost of input and output tokens for their provider of choice to recalculate the tradeoff between cost and accuracy. In turn, downstream evaluations of agents should include input/output token counts in addition to dollar costs, so that anyone looking at the evaluation in the future can instantly recalculate the cost using current prices.

But ultimately, despite the hurdles, good measurement requires modeling the underlying construct of interest. For downstream evaluations, that underlying construct is cost. All other proxies are lacking.

Agent evaluations lack standardization and reproducibility

In the course of our evaluation, we found many shortcomings in the reproducibility and standardization of agent evaluations.

These shortcomings in the empirical results have also led to errors of interpretation in broader discussions around the accuracy of AI agents. For example, a recent post by Andrew Ng claimed that agents that use GPT-3.5 can outperform GPT-4. In particular, he claimed: 

[For HumanEval,] GPT-3.5 (zero shot) was 48.1% correct. GPT-4 (zero shot) does better at 67.0%. However, the improvement from GPT-3.5 to GPT-4 is dwarfed by incorporating an iterative agent workflow. Indeed, wrapped in an agent loop, GPT-3.5 achieves up to 95.1%. 

While this claim received a lot of attention, it is incorrect. The claim ("GPT-3.5 wrapped in an agent workflow achieves 95.1% accuracy") seems to be about the LDB agent. The Papers With Code leaderboard for HumanEval makes the same claim. However, as we discussed above, for LDB, GPT-3.5 is only used to find bugs. The code is generated using GPT-4 (or the Reflexion agent that uses GPT-4), not GPT-3.5. Unfortunately, the error in the paper has led to much overoptimism about agents in the broader AI community.

Ng's post also makes the familiar error of repeating results from papers without verifying them or accounting for changes in prompts and model versions. For example, the zero-shot accuracy numbers of GPT-3.5 (48.1%) and GPT-4 (67.0%) seem to be copied from the GPT-4 technical report from March 2023. However, the models have been updated many times since release. Indeed, in our comparison, we find that the base models perform much better compared to the claimed figures in Ng's post when we use them with the prompts provided with the LDB paper (GPT-3.5: 73.9%, GPT-4: 89.6%). As a result, the post drastically overestimates the improvement attributable to agent architectures.

Evaluation frameworks like Stanford's HELM and EleutherAI's LM Evaluation Harness attempt to fix similar shortcomings for model evaluations, by providing standardized evaluation results. We are working on solutions to make agent evaluations standardized and reproducible, especially from the perspective of downstream evaluation of agents.

Finally, downstream developers should keep in mind that HumanEval or any other standardized benchmark is nothing more than a rough proxy for the specific tasks that arise in a particular downstream application. To understand how agents will perform in practice, it is necessary to evaluate them on a custom dataset from the domain of interest — or even better, A/B test different agents in the production environment.

Further reading

The code for reproducing our analysis is available here. The appendix includes more details about our setup and results.

Acknowledgments

We thank Rishi Bommasani, Rumman Chowdhury, Percy Liang, Shayne Longpre, Yifan Mai, Nitya Nadgir, Matt Salganik, Hailey Schoelkopf, Zachary Siegel, and Venia Veselovsky for discussions and inputs that informed our analysis. We acknowledge Cunxiang Wang and Ruoxi Ning for their prompt responses to our questions about the NovelQA benchmark. 

We are grateful to the authors of the papers we engage with in this post for their quick responses and for sharing their code, which makes such reproduction analysis possible in the first place. In particular, we are grateful to Zilong Wang (LDB), Andy Zhou (LATS), and Karthik Narasimhan (Reflexion), who gave us feedback in response to an earlier draft of this blog post.

1

The leaderboard on the linked page lists AgentCoder as the most accurate system. However, the code or data for reproducing the results of this agent are not available online, so we do not consider it in this blog post.

2

This post is about agents. Leaderboards are also becoming less useful for evaluating the underlying models. There are many problems, including gameability. But controlling for inference cost isn’t the main problem, so our arguments don’t necessarily apply.

3

Tasks where increased compute could help indefinitely are primarily those where verifying whether a solution is correct is easy. In the case of programming questions, this takes the form of test cases that are provided with each question to check if the answer is correct. Other examples include proving theorems, because verifying if a theorem is correct can be straightforward, as well as some tasks on the internet for agents that navigate the web. That said, even for tasks where there is no way to guess a solution and then verify, the costs of different agents can vary by orders of magnitude.

4

We included agents from the HumanEval leaderboard on PapersWithCode that share their code publicly. Reflexion is absent from the PapersWithCode list, but it has a reported accuracy of 91% (higher than any other agents with publicly available code apart from LDB and LATS), so we included it too.

5

For the model evaluation, we only used the description of the coding problem as well as the example tests provided with the HumanEval dataset. Three of the 164 coding problems in HumanEval lack example tests. The authors of LDB include a modified version of HumanEval with example tests included for these three problems. We use this modified version for all experiments.

6

In all of the baselines we provide, we don't use the test cases used to evaluate if the solution is correct when deciding to retry, only the ones in the problem description, to avoid leakage.

7

We evaluated Llama-3 using together.ai endpoints. The cost per million tokens on together.ai, for both prompt and completion, is 0.20$ and 0.90$ for Llama-3-8B and Llama-3-70B, respectively.

8

This is also true for other desired properties of agents, such as running time. We report results for time vs. accuracy tradeoffs in the appendix.

9

While some of the papers introducing these models discuss cost abstractly, such as the relationship between cost and number of times an agent retries, they don't report any concrete numbers on cost or compare token count to a baseline.

10

The cost comparison is for LDB (Reflexion, GPT-3.5), since that is the top-performing agent reported by the authors of LDB. 

11

In addition to HumanEval, we also ran experiments on the HotPotQA and NovelQA benchmarks for question answering. We found similar results for both benchmarks. In particular, we found that there can be large differences in cost underlying small improvements in accuracy for both benchmarks.

12

One potential concern with our analysis is that while we relied on the April 2024 version of OpenAI models, many papers relied on older model versions for their results. To address this, we report results for an additional robustness check with the June 2023 version of OpenAI models in the appendix; we find substantially similar results across model versions.

13

While HumanEval is commonly used to evaluate how well AI can solve coding problems, it is limited due to its small size (only 164 questions), lack of difficult problems (none of the problems involve real-world tasks), and potential contamination, since language models have likely been trained on HumanEval problems, which might inflate the performance of the simple baselines we test. A more rigorous examination of hypotheses related to whether System 2 thinking helps will likely require the use of more comprehensive and robust benchmarks, such as SWE-bench.

14

LDB uses already-existing solutions to improve them by debugging. The existing solutions can come from models like GPT-3.5 or GPT-4, or from agents like Reflexion. Since the authors of Reflexion provided all of the generated solutions in their Github repo, the authors of the LDB paper used code from the original Reflexion repository to run their analysis, rather than rerunning the Reflexion agent. The difference between the reported results and our reproduced results could be due to differences in the code generated by the Reflexion agent. Reusing Reflexion solutions is a reasonable choice for evaluating the usefulness of debugging (indeed, we see LDB increases the accuracy over using the models alone). The problem arises when their final accuracy is interpreted as a downstream evaluation, since it might give developers an inflated estimate of the accuracy of such techniques for coding.

15

The authors acknowledge this and plan to update their results.

16

The authors of LDB only tested the GPT-3.5 model as the debugger, which performed notably worse than the agent using GPT-4 as the debugger, with an accuracy of 88.9% for LDB (GPT-3.5 + Reflexion) vs. 92.9% for LDB (GPT-4 + Reflexion).

17

In correspondence with the authors of LATS, they clarified: "Originally, there was an execution error when evaluating some test cases for [one of the HumanEval test cases], so we opted to remove it from our setting."

18

The authors acknowledge this and plan to update the paper to address it.

19

The authors acknowledge this and plan to update the paper to address it.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI代码生成 代理架构 准确率-成本权衡 HumanEval基准 模型评估标准化 深度学习成本分析 AI基准测试问题
相关文章