一种新的记忆机制：内存轨迹在强化学习中的应用

ΑΙhub 09月13日

../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

本文介绍了一种名为“内存轨迹”的新型记忆框架，旨在解决强化学习中部分可观察马尔可夫决策过程（POMDP）的挑战。传统的滑动窗口记忆方法在处理长序列时计算成本呈指数级增长，使其在长走廊的T型迷宫等环境中难以应用。内存轨迹通过指数移动平均来整合历史观测，并使用遗忘因子控制信息衰减。研究表明，在特定条件下，内存轨迹能够比滑动窗口更有效地处理信息，显著降低学习成本，为解决POMDP问题提供了新的思路。

🎯 **内存轨迹应对POMDP挑战**：在强化学习领域，T型迷宫等任务属于部分可观察马尔可夫决策过程（POMDP），其中最优决策不仅依赖于当前观测，还需结合过去的观测信息。传统的滑动窗口记忆方法在处理长序列时，学习样本数量会随窗口长度呈指数增长，导致计算成本过高。

💡 **内存轨迹的工作原理**：内存轨迹是一种整合历史观测的新型记忆框架，通过计算观测历史的指数移动平均来实现。其核心在于一个遗忘因子（forgetting factor, λ），该因子决定了过去信息被遗忘的速度。通过调整λ，内存轨迹能够在保留关键历史信息的同时，避免信息过载，从而使代理在决策时能够访问到必要的过去信息。

📊 **内存轨迹的优势与局限**：研究表明，当遗忘因子λ=1时，内存轨迹能够保留全部历史观测信息，理论上可以从单一的内存轨迹向量中解码所有过去观测。然而，这使得学习成本与无限长滑动窗口相当。关键在于，通过限制学习函数的“分辨率”（即Lipschitz常数），当λ较小（快速遗忘）时，内存轨迹与滑动窗口表现相似；而当λ较大（慢速遗忘）时，内存轨迹在某些环境中（如T型迷宫）能显著优于滑动窗口，学习成本从指数级转变为多项式级。

The T-maze, shown below, is a prototypical example of a task studied in the field of reinforcement learning. An artificial agent enters the maze from the left and immediately receives one of two possible observations: red or green. Red means that the agent will be rewarded for moving to the top at the right end of the corridor (in the question mark tile), while green means the opposite: the agent will be rewarded for moving down. While this seems like a trivial task, modern machine learning algorithms (such as Q-learning) fail at learning the desired behavior. This is because these algorithms are designed to solve Markov Decision Processes (MDPs). In an MDP, optimal agents are reactive: the optimal action depends only on the current observation. However, in the T-maze, the blue question mark tile does not give enough information: the optimal action (going up or down) depends also on the first observation (red or green). Such an environment is called a Partially Observable Markov Decision Process (POMDP).

In a POMDP, it is necessary for an agent to keep a memory of past observations. The most common type of memory is a sliding window of a fixed length . If the complete history of observations up to time is , then the sliding window memory is . In the T-maze, since we have to remember the first observation until we reach the blue tile, the length of the window has to be at least equal to the corridor length. The problem with this approach is that learning with long windows is expensive! We can show [1] that learning with windows of length generally requires a number of samples that scales exponentially in . Thus, learning in the T-maze with the naive sliding window memory is not tractable if the corridor is very long.

Our new work introduces an alternative memory framework: memory traces. The memory trace is an exponential moving average of the history of observations. Formally, . The forgetting factor controls how quickly the past is forgotten. This memory is illustrated in the T-maze above. There are 4 possible observations (colors), and thus memory traces take the form of 4-vectors. In this example, the initial observation is green. As the agent walks along the corridor, this initial observation slowly fades in the memory trace. Once the agent reaches the blue decision state, the information from the first observation is still accessible in the memory trace, making optimal behavior possible.

To understand whether memory traces provide any benefit over sliding windows, it is helpful to visualize the space of memory traces. Consider the case where there are three possible observations: , , and . Memory traces are linear combinations of these three vectors, but in this case it turns out that they all lie in a 2-dimensional subspace, so that we can easily visualize them. The picture below shows the set of all possible memory traces for different history lengths with the forgetting factor . The set of memory traces forms a recursive Sierpiński triangle.

The picture changes if we vary the forgetting factor , as shown below.

Reference

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签