TechCrunch News 09月22日
AI代理的训练新范式:强化学习环境的兴起
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

当前AI代理技术仍显局限,行业正探索新的训练方法。强化学习(RL)环境,即模拟工作空间的训练平台,正成为关键。大型AI实验室纷纷投入RL环境的研发与采购,催生了如Mechanize、Prime Intellect等新兴企业,以及Mercor、Surge等传统数据标注公司。RL环境通过模拟真实软件应用场景,让AI代理学习多步骤任务,并根据表现获得奖励信号。尽管构建复杂且易出错,但RL环境相比静态数据集更具优势。OpenAI的RL Gyms和AlphaGo等过往项目已证明其潜力。目前,行业正致力于开发更通用、能力更强的计算机使用型AI代理。虽然存在奖励函数设计、规模化挑战等疑虑,但RL环境被视为推动AI进步的重要方向。

🚀 **强化学习环境成为AI代理训练的关键**:为了克服当前AI代理在自主完成任务方面的局限性,行业正转向使用强化学习(RL)环境。这些环境模拟真实的工作空间,使AI代理能够通过多步骤任务的训练来提升能力,如同为AI提供一个“虚拟游乐场”来学习和实践。

💡 **RL环境的兴起与市场动态**:大型AI实验室如Anthropic、OpenAI、Google和Meta正大力投资RL环境的研发,并寻求第三方供应商。这催生了专门的初创公司如Mechanize和Prime Intellect,以及传统数据标注公司如Mercor和Surge的转型,它们正积极投入资源以满足市场需求,目标是成为AI环境领域的“Scale AI”。

🎮 **RL环境的工作原理与复杂性**:RL环境通过模拟用户与软件应用的交互来训练AI代理,例如在浏览器中购物。AI代理的表现会被评估并获得奖励信号。构建这些环境极其复杂,需要能够捕捉AI代理可能出现的各种意外行为,并提供有效的反馈,远超静态数据集的构建难度。

📈 **RL环境的潜力和挑战**:RL环境被视为推动AI进步的关键,尤其是在当前静态数据集方法收益递减的情况下。它们允许AI代理使用工具、访问互联网,并与多种软件应用交互,从而实现更通用的能力。然而,奖励函数的设计、防止AI“作弊”(reward hacking)以及大规模部署的成本仍然是重要的挑战。

For years, Big Tech CEOs have touted visions of AI agents that can autonomously use software applications to complete tasks for people. But take today’s consumer AI agents out for a spin, whether it’s OpenAI’s ChatGPT Agent or Perplexity’s Comet, and you’ll quickly realize how limited the technology still is. Making AI agents more robust may take a new set of techniques that the industry is still discovering.

One of those techniques is carefully simulating workspaces where agents can be trained on multi-step tasks — known as reinforcement learning (RL) environments. Similarly to how labeled datasets powered the last wave of AI, RL environments are starting to look like a critical element in the development of agents.

AI researchers, founders, and investors tell TechCrunch that leading AI labs are now demanding more RL environments, and there’s no shortage of startups hoping to supply them.

“All the big AI labs are building RL environments in-house,” said Jennifer Li, general partner at Andreessen Horowitz, in an interview with TechCrunch. “But as you can imagine, creating these datasets is very complex, so AI labs are also looking at third party vendors that can create high quality environments and evaluations. Everyone is looking at this space.”

The push for RL environments has minted a new class of well-funded startups, such as Mechanize and Prime Intellect, that aim to lead the space. Meanwhile, large data-labeling companies like Mercor and Surge say they’re investing more in RL environments to keep pace with the industry’s shifts from static datasets to interactive simulations. The major labs are considering investing heavily too: according to The Information, leaders at Anthropic have discussed spending more than $1 billion on RL environments over the next year.

The hope for investors and founders is that one of these startups emerge as the “Scale AI for environments,” referring to the $29 billion data labelling powerhouse that powered the chatbot era.

The question is whether RL environments will truly push the frontier of AI progress.

Techcrunch event

San Francisco | October 27-29, 2025

At their core, RL environments are training grounds that simulate what an AI agent would be doing in a real software application. One founder described building them in recent interview “like creating a very boring video game.”

For example, an environment could simulate a Chrome browser and task an AI agent with purchasing a pair of socks on Amazon. The agent is graded on its performance and sent a reward signal when it succeeds (in this case, buying a worthy pair of socks).

While such a task sounds relatively simple, there are a lot of places where an AI agent could get tripped up. It might get lost navigating the web page’s drop down menus, or buy too many socks. And because developers can’t predict exactly what wrong turn an agent will take, the environment itself has to be robust enough to capture any unexpected behavior, and still deliver useful feedback. That makes building environments far more complex than a static dataset.

Some environments are quite elaborate, allowing for AI agents to use tools, access the internet, or use various software applications to complete a given task. Others are more narrow, aimed at helping an agent learn specific tasks in enterprise software applications.

While RL environments are the hot thing in Silicon Valley right now, there’s a lot of precedent for using this technique. One of OpenAI’s first projects back in 2016 was building “RL Gyms,” which were quite similar to the modern conception of environments. The same year, Google DeepMind’s AlphaGo AI system beat a world champion at the board game, Go. It also used RL techniques within a simulated environment.

What’s unique about today’s environments is that researchers are trying to build computer-using AI agents with large transformer models. Unlike AlphaGo, which was a specialized AI system working in a closed environments, today’s AI agents are trained to have more general capabilities. AI researchers today have a stronger starting point, but also a complicated goal where more can go wrong.

AI data labeling companies like Scale AI, Surge, and Mercor are trying to meet the moment and build out RL environments. These companies have more resources than many startups in the space, as well as deep relationships with AI labs.

Surge CEO Edwin Chen tells TechCrunch he’s recently seen a “significant increase” in demand for RL environments within AI labs. Surge — which reportedly generated $1.2 billion in revenue last year from working with AI labs like OpenAI, Google, Anthropic and Meta — recently spun up a new internal organization specifically tasked with building out RL environments, he said.

Close behind Surge is Mercor, a startup valued at $10 billion, which has also worked with OpenAI, Meta, and Anthropic. Mercor is pitching investors on its business building RL environments for domain specific tasks such as coding, healthcare, and law, according to marketing materials seen by TechCrunch.

Mercor CEO Brendan Foody told TechCrunch in an interview that “few understand how large the opportunity around RL environments truly is.”

Scale AI used to dominate the data labeling space, but has lost ground since Meta invested $14 billion and hired away its CEO. Since then, Google and OpenAI dropped Scale AI as a data provider, and the startup even faces competition for data labelling work inside of Meta. But still, Scale is trying to meet the moment and build environments.

“This is just the nature of the business [Scale AI] is in,” said Chetan Rane, Scale AI’s head of product for agents and RL environments. “Scale has proven its ability to adapt quickly. We did this in the early days of autonomous vehicles, our first business unit. When ChatGPT came out, Scale AI adapted to that. And now, once again, we’re adapting to new frontier spaces like agents and environments.”

Some newer players are focusing exclusively on environments from the outset. Among them is Mechanize, a startup founded roughly six months ago with the audacious goal of “automating all jobs.” However, co-founder Matthew Barnett tells TechCrunch that his firm is starting with RL environments for AI coding agents.

Mechanize aims to supply AI labs with a small number of robust RL environments, Barnett says, rather than larger data firms that create a wide range of simple RL environments. To this point, the startup is offering software engineers $500,000 salaries to build RL environments — far higher than an hourly contractor could earn working at Scale AI or Surge.

Mechanize has already been working with Anthropic on RL environments, two sources familiar with the matter told TechCrunch. Mechanize and Anthropic declined to comment on the partnership.

Other startups are betting that RL environments will be influential outside of AI labs. Prime Intellect — a startup backed by AI researcher Andrej Karpathy, Founders Fund, and Menlo Ventures — is targeting smaller developers with its RL environments.

Last month, Prime Intellect launched an RL environments hub, which aims to be a “Hugging Face for RL environments.” The idea is to give open-source developers access to the same resources that large AI labs have, and sell those developers access to computational resources in the process.

Training generally capable agents in RL environments can be more computational expensive than previous AI training techniques, according to Prime Intellect researcher Will Brown. Alongside startups building RL environments, there’s another opportunity for GPU providers that can power the process.

“RL environments are going to be too large for any one company to dominate,” said Brown in an interview. “Part of what we’re doing is just trying to build good open-source infrastructure around it. The service we sell is compute, so it is a convenient onramp to using GPUs, but we’re thinking of this more in the long term.”

The open question around RL environments is whether the technique will scale like previous AI training methods.

Reinforcement learning has powered some of the biggest leaps in AI over the past year, including models like OpenAI’s o1 and Anthropic’s Claude Opus 4. Those are particularly important breakthroughs because the methods previously used to improve AI models are now showing diminishing returns

Environments are part of AI labs’ bigger bet on RL, which many believe will continue to drive progress as they add more data and computational resources to the process. Some of the OpenAI researchers behind o1 previously told TechCrunch that the company originally invested in AI reasoning models — which were created through investments in RL and test-time-compute — because they thought it would scale nicely.

The best way to scale RL remains unclear, but environments seem like a promising contender. Instead of simply rewarding chatbots for text responses, they let agents operate in simulations with tools and computers at their disposal. That’s far more resource-intensive, but potentially more rewarding.

Some are skeptical that all these RL environments will pan out. Ross Taylor, a former AI research lead with Meta that co-founded General Reasoning, tells TechCrunch that RL environments are prone to reward hacking. This is a process in which AI models cheat in order to get a reward, without really doing the task.

“I think people are underestimating how difficult it is to scale environments,” said Taylor. “Even the best publicly available [RL environments] typically don’t work without serious modification.”

OpenAI’s Head of Engineering for its API business, Sherwin Wu, said in a recent podcast that he was “short” on RL environment startups. Wu noted that it’s a very competitive space, but also that AI research is evolving so quickly that it’s hard to serve AI labs well.

Karpathy, an investor in Prime Intellect that has called RL environments a potential breakthrough, has also voiced caution for the RL space more broadly. In a post on X, he raised concerns about how much more AI progress can be squeezed out of RL.

“I am bullish on environments and agentic interactions but I am bearish on reinforcement learning specifically,” said Karpathy.

Update: A previous version of this article referred to Mechanize as Mechanize Work. It has been updated to reflect the company’s official name.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI代理 强化学习 RL环境 AI训练 机器学习 AI技术 AI Agents Reinforcement Learning RL Environments AI Training Machine Learning AI Technology
相关文章