Ankit Maloo 09月30日
AI环境的价值与趋势
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

AI环境为强化学习提供观察、行动、奖励和状态转换,是获取先验知识的关键工具。当前趋势显示,环境在模型泛化初期至关重要,但随着先验知识的积累,其价值逐渐转化为评估工具。新鲜、专有的反馈循环或预测性奖励模型(SRM)能构建持久竞争优势,尤其在反映现实变化的任务中。实时用户交互环境如Cursor和Mercor通过持续更新先验知识形成独特护城河,但需注意Goodhart定律和模型过拟合风险。

🔍 AI环境通过提供观察、行动、奖励和状态转换,为强化学习算法提供训练数据,帮助模型快速泛化。在模型先验知识不足时,环境是构建技能先验的核心工具,其价值随先验积累而降低。

📈 预测性奖励模型(SRM)通过模拟人类行为或复杂系统交互,解决奖励延迟或主观性问题,例如在药物发现或商业决策中预测结果概率。成功维护SRM需要持续的数据更新和校准,形成独特护城河。

🌐 实时用户交互环境(如Cursor)通过持续收集真实用户行为,动态更新模型先验,其价值在于数据排他性和用户规模,但需应对不断变化的学习需求,无法完全替代传统环境。

🛡️ 对于模拟购物、预订等偏好驱动任务,环境难以捕捉用户隐式偏好和动态价格,除非结合平台信号和用户偏好模型。这类任务的环境价值取决于能否获取供应商数据,构建精准的隐藏状态模型。

⚙️ 环境的价值周期可分为:冷启动阶段(环境制造训练数据)、技能构建阶段(SFT/RL学习)和评估阶段(环境转为安全测试工具)。持久竞争优势来自能持续提供新鲜先验的环境,如捕捉人类决策模式的预测性模型。

AI trends move fast. Writing about them makes sense if it’s your job. I’m picking this up because it’s escalated recently, and some discussion seems needed. This is me trying to make sense of an RL environment (and this understanding may evolve).

This is pretty much sparked by a few things. Some RL environment startups are getting hot, others are using environments as an open source growth engine. Some cautioned against building and RL environment startup as well. I will leave you to read these linked posts to understand the basics.

tl;dr: Environments are valuable until priors saturate. Durable moats come from fresh, proprietary feedback loops or predictive reward models that reflect shifting reality.

What is an RL environment?

An environment supplies observations, accepts actions, emits rewards, and transitions state for a given RL setup. Algorithms (PPO, RLHF/DPO variants) optimize behavior within that environment. Evals are environments without learning turned on.

Why verifiable-reward RL works now?

RL can be understood as a function of three levers: environment (where the reward comes from), algorithm (how you propagate the reward), and experience/training set of the base model (referred to as ‘prior’ in both Bayesian and RL literature). All throughout the RL research, we have focused heavily on algorithms, coming up with new ways to propagate the reward and always starting from scratch. In this excellent post called The Second Half by Shunyu, he breaks down how priors are the most important aspect of the three, but we had no way of getting there previously.

Over the last five years, we scaled pretraining to the entire internet and more. Models knew about all topics, and could respond intelligently with details. But, something was missing - an ability to make sense of all that pretrained knowledge. This is where Chain of Thought (CoT) came in. It lets models (not unlike humans) connect the dots and generalize from what they know. When you give a model compute to think before acting, it can use its knowledge in important ways. As Shunyu says:

language generalizes through reasoning in agents

So the problem of making models better at a given task reduces to: providing the right priors for the given task1. With enough knowledge and ability to reason, the model will generalize and get better at solving those tasks.

Environments and algorithms are tools to elicit or update these priors2. A language model with strong priors and high ttc to reason will be able to finish related tasks. Moreover, we can teach a model new priors by designing the right set of tasks or evals.

Environments can generate training examples

Environments are particularly good at generating training examples quickly (which then become priors). But the catch is, once the model has enough training to generalize, the marginal utility of the environment goes from being critical to an eval harness. They remain useful for evaluations, safety/regression tests, and incremental improvements but not as valuable. Say web browsing3:

Or, you can also go through other routes:

Once a model has enough priors, it just needs reasoning ability and interaction with live websites. You don’t need special environments for every new website. They are now only useful for evaluations or safety training.

When environments create value vs collapse to eval harness

I picked web browsing example deliberately because it’s simple, crisp, short horizon problem once the action prior exists. Multi turn agentic work is more complex, but the same principle holds though applied differently. You would need orders of magnitude more steps, and need to figure out a long horizon credit assignment and final reward. Recipe remains the same:

cold start (no priors) → environment manufactures trajectories → SFT/RL to build skill priors → measure generalization on unseen tasks → environment reduces to eval harness + safety. This pattern recurs in coding (compile/tests as reward), compliance (approval as reward), and sales (response/outcome as reward proxy).

So, does this mean all environments eventually become worthless? It depends on priors the AI needs to learn stays constant or keeps changing. This brings us to an interesting type of environment:

Where moats can be created

Predictive Reward Environments

The technical term for this is Surrogate Reward Models (SRM)

When rewards are delayed or subjective, you need a surrogate reward model(SRM) trained on real outcomes. That model is only as good as the breadth, freshness and exclusivity of the data you feed it, maintaining it is the key.

In cases like creating an effective sales pitch, knowing if a strategy document is good enough, or getting a compliance report approved, rewards are subjective and depend on multiple factors. You need environments that can model either human behavior or complex system interactions.

Here’s where predictive reward models come in. In drug discovery, we have models that can look at a protein structure and predict its binding probability, and assign that as a reward, instead of going and testing the structure in a wet lab. It’s a cheap method, that is instant, scalable and can model delayed outcomes.

In business contexts we need models that can predict the “probability of this generated compliance report getting approved?”, “likelihood of a business committee preferring one strategy report vs another” and so on.

However, LLMs trained on these are prone to Goodhart’s law and reward hacking. Models frequently overfit to the proxy than the real objective. Moreover, SRMs fail quietly without recalibration and drift checks.

Once you manage all that, SRMs are more like individual products, not datasets. SRM + live data flywheel here is a defensible moat.

Cursor, Mercor, and Real World “Environments”

While I was writing this, Cursor announced their release of online RL with training via live user interactions and rolling out a new policy every two hours. Then Mercor CEO wrote an article on X about a similar thing. For the context of this post, when it comes to real world as environments, it’s a completely different ballgame. I think the failure modes are different, moats are more about user scale, sampling, and data exclusivity, and the goal is to capture consistently changing priors. The product becomes the environment, constantly updating priors with real interactions.

Though changing priors scenario means that Mercor’s claim of “Teaching an AI once is a fixed cost that eliminates the corresponding variable human cost forever.” is not going to be widely applicable. You have to keep teaching in many such scenarios.

Beyond SRMs and real world environments, there’s another category where environments struggle: preference driven tasks with hidden states.

Environments for simulating buying on Amazon or booking a flight/hotel

I built a very early AI travel chatbot in 2018. Biggest learning from that was different users have different implicit preferences and merely finding the optimal flight/hotel is not worth much. Users want AI anything to read their mind. Incorporating their preferences nearly doubled our conversions5.

Travel booking and Amazon buying both are a composite of two things:

I’ll posit that given the value from Amazon’s recommendation engine is high, the environments or even the buying workflow would not find many real world takers (unless amazon offers an environment) because of how varied the outcome is for every user. Decison making as a prior is hard to simulate without the data from the providers. With flights and hotels, there is added complexity of dynamic pricing and modeling hidden states (inventory etc.). Building these models from scratch is a hard task, but there is huge alpha for anyone who can.

Conclusion

If you are thinking of building a RL environment startup because everyone is doing it, or models need RL, know the game you are playing. Useful to think in terms of priors and how they change.

Build for shifting priors with fresh, verifiable feedback. Everything else decays into an eval harness with a shinier marketing term.


PS: I am not building an RL environment but working on some cool ideas on long horizon RL. Please reach out if you would like to know more.

  1. With thinking / test time compute, these priors generalize in a given environment. I see the end goal as to get to a model that can do tasks, with or without RL, so priors are the key, thinking or generalization is an action. Priors can be about knowledge, skill, preference, or even context (retrieval). That is a separate post though. 

  2. With backpropagation and verifiable rewards in the right environment, model learns about the specifics on how to solve a given task, and update the priors. 

  3. From the same Shunyu blog post: > Language pre-training created good priors for chatting, but not equally good for controlling computers or playing video games. Why? These domains are further from the distribution of Internet text, and naively doing SFT / RL on these domains generalizes poorly. So you need to add more in training data. 

  4. For a task to be done well, the model needs to learn and narrow down which trajectory of generalization helps. 

  5. I won’t go into much details here on how. Long story short, we sent a 25 question survey to every new user, got 500 responses, interviewed them further, and built an engine to incorporate those. More nuanced than what it sounds, and was directionally right. Perhaps will cover in another post. 

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI环境 强化学习 先验知识 预测性奖励模型 SRM AI创业 技术趋势
相关文章