AI代理的训练新范式：强化学习环境的兴起

For years, Big Tech CEOs have touted visions of AI agents that can autonomously use software applications to complete tasks for people. But take today’s consumer AI agents out for a spin, whether it’s OpenAI’s ChatGPT Agent or Perplexity’s Comet, and you’ll quickly realize how limited the technology still is. Making AI agents more robust may take a new set of techniques that the industry is still discovering.

One of those techniques is carefully simulating workspaces where agents can be trained on multi-step tasks — known as reinforcement learning (RL) environments. Similarly to how labeled datasets powered the last wave of AI, RL environments are starting to look like a critical element in the development of agents.

AI researchers, founders, and investors tell TechCrunch that leading AI labs are now demanding more RL environments, and there’s no shortage of startups hoping to supply them.

“All the big AI labs are building RL environments in-house,” said Jennifer Li, general partner at Andreessen Horowitz, in an interview with TechCrunch. “But as you can imagine, creating these datasets is very complex, so AI labs are also looking at third party vendors that can create high quality environments and evaluations. Everyone is looking at this space.”

The push for RL environments has minted a new class of well-funded startups, such as Mechanize and Prime Intellect, that aim to lead the space. Meanwhile, large data-labeling companies like Mercor and Surge say they’re investing more in RL environments to keep pace with the industry’s shifts from static datasets to interactive simulations. The major labs are considering investing heavily too: according to The Information, leaders at Anthropic have discussed spending more than $1 billion on RL environments over the next year.

The hope for investors and founders is that one of these startups emerge as the “Scale AI for environments,” referring to the $29 billion data labelling powerhouse that powered the chatbot era.

The question is whether RL environments will truly push the frontier of AI progress.

Techcrunch event

San Francisco | October 27-29, 2025

At their core, RL environments are training grounds that simulate what an AI agent would be doing in a real software application. One founder described building them in recent interview “like creating a very boring video game.”

For example, an environment could simulate a Chrome browser and task an AI agent with purchasing a pair of socks on Amazon. The agent is graded on its performance and sent a reward signal when it succeeds (in this case, buying a worthy pair of socks).

While such a task sounds relatively simple, there are a lot of places where an AI agent could get tripped up. It might get lost navigating the web page’s drop down menus, or buy too many socks. And because developers can’t predict exactly what wrong turn an agent will take, the environment itself has to be robust enough to capture any unexpected behavior, and still deliver useful feedback. That makes building environments far more complex than a static dataset.

Some environments are quite elaborate, allowing for AI agents to use tools, access the internet, or use various software applications to complete a given task. Others are more narrow, aimed at helping an agent learn specific tasks in enterprise software applications.

While RL environments are the hot thing in Silicon Valley right now, there’s a lot of precedent for using this technique. One of OpenAI’s first projects back in 2016 was building “RL Gyms,” which were quite similar to the modern conception of environments. The same year, Google DeepMind’s AlphaGo AI system beat a world champion at the board game, Go. It also used RL techniques within a simulated environment.

What’s unique about today’s environments is that researchers are trying to build computer-using AI agents with large transformer models. Unlike AlphaGo, which was a specialized AI system working in a closed environments, today’s AI agents are trained to have more general capabilities. AI researchers today have a stronger starting point, but also a complicated goal where more can go wrong.

AI data labeling companies like Scale AI, Surge, and Mercor are trying to meet the moment and build out RL environments. These companies have more resources than many startups in the space, as well as deep relationships with AI labs.

Surge CEO Edwin Chen tells TechCrunch he’s recently seen a “significant increase” in demand for RL environments within AI labs. Surge — which reportedly generated $1.2 billion in revenue last year from working with AI labs like OpenAI, Google, Anthropic and Meta — recently spun up a new internal organization specifically tasked with building out RL environments, he said.

Close behind Surge is Mercor, a startup valued at $10 billion, which has also worked with OpenAI, Meta, and Anthropic. Mercor is pitching investors on its business building RL environments for domain specific tasks such as coding, healthcare, and law, according to marketing materials seen by TechCrunch.

Mercor CEO Brendan Foody told TechCrunch in an interview that “few understand how large the opportunity around RL environments truly is.”

Scale AI used to dominate the data labeling space, but has lost ground since Meta invested $14 billion and hired away its CEO. Since then, Google and OpenAI dropped Scale AI as a data provider, and the startup even faces competition for data labelling work inside of Meta. But still, Scale is trying to meet the moment and build environments.

“This is just the nature of the business [Scale AI] is in,” said Chetan Rane, Scale AI’s head of product for agents and RL environments. “Scale has proven its ability to adapt quickly. We did this in the early days of autonomous vehicles, our first business unit. When ChatGPT came out, Scale AI adapted to that. And now, once again, we’re adapting to new frontier spaces like agents and environments.”

Some newer players are focusing exclusively on environments from the outset. Among them is Mechanize, a startup founded roughly six months ago with the audacious goal of “automating all jobs.” However, co-founder Matthew Barnett tells TechCrunch that his firm is starting with RL environments for AI coding agents.

Mechanize aims to supply AI labs with a small number of robust RL environments, Barnett says, rather than larger data firms that create a wide range of simple RL environments. To this point, the startup is offering software engineers $500,000 salaries to build RL environments — far higher than an hourly contractor could earn working at Scale AI or Surge.

Mechanize has already been working with Anthropic on RL environments, two sources familiar with the matter told TechCrunch. Mechanize and Anthropic declined to comment on the partnership.

Other startups are betting that RL environments will be influential outside of AI labs. Prime Intellect — a startup backed by AI researcher Andrej Karpathy, Founders Fund, and Menlo Ventures — is targeting smaller developers with its RL environments.

Last month, Prime Intellect launched an RL environments hub, which aims to be a “Hugging Face for RL environments.” The idea is to give open-source developers access to the same resources that large AI labs have, and sell those developers access to computational resources in the process.

Training generally capable agents in RL environments can be more computational expensive than previous AI training techniques, according to Prime Intellect researcher Will Brown. Alongside startups building RL environments, there’s another opportunity for GPU providers that can power the process.

“RL environments are going to be too large for any one company to dominate,” said Brown in an interview. “Part of what we’re doing is just trying to build good open-source infrastructure around it. The service we sell is compute, so it is a convenient onramp to using GPUs, but we’re thinking of this more in the long term.”

The open question around RL environments is whether the technique will scale like previous AI training methods.

Reinforcement learning has powered some of the biggest leaps in AI over the past year, including models like OpenAI’s o1 and Anthropic’s Claude Opus 4. Those are particularly important breakthroughs because the methods previously used to improve AI models are now showing diminishing returns.

Environments are part of AI labs’ bigger bet on RL, which many believe will continue to drive progress as they add more data and computational resources to the process. Some of the OpenAI researchers behind o1 previously told TechCrunch that the company originally invested in AI reasoning models — which were created through investments in RL and test-time-compute — because they thought it would scale nicely.

The best way to scale RL remains unclear, but environments seem like a promising contender. Instead of simply rewarding chatbots for text responses, they let agents operate in simulations with tools and computers at their disposal. That’s far more resource-intensive, but potentially more rewarding.

Some are skeptical that all these RL environments will pan out. Ross Taylor, a former AI research lead with Meta that co-founded General Reasoning, tells TechCrunch that RL environments are prone to reward hacking. This is a process in which AI models cheat in order to get a reward, without really doing the task.

“I think people are underestimating how difficult it is to scale environments,” said Taylor. “Even the best publicly available [RL environments] typically don’t work without serious modification.”

OpenAI’s Head of Engineering for its API business, Sherwin Wu, said in a recent podcast that he was “short” on RL environment startups. Wu noted that it’s a very competitive space, but also that AI research is evolving so quickly that it’s hard to serve AI labs well.

Karpathy, an investor in Prime Intellect that has called RL environments a potential breakthrough, has also voiced caution for the RL space more broadly. In a post on X, he raised concerns about how much more AI progress can be squeezed out of RL.

“I am bullish on environments and agentic interactions but I am bearish on reinforcement learning specifically,” said Karpathy.

Update: A previous version of this article referred to Mechanize as Mechanize Work. It has been updated to reflect the company’s official name.

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签