Machined Learnings 09月12日
上下文 bandit 的改进及其应用
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了上下文 bandit 的改进及其应用,提出了 IGL 设置,该设置通过修改传统的上下文 bandit 模型,引入了条件独立性假设,使得模型能够更好地处理部分反馈问题。文章还讨论了该模型在音乐推荐和脑机交互等领域的应用潜力,并指出了未来研究的方向。

🔍 IGL 设置是对传统上下文 bandit 模型的改进,通过引入条件独立性假设,使得模型能够更好地处理部分反馈问题。

🎶 在音乐推荐场景中,该模型能够根据用户的反馈(如点赞、添加到播放列表、下载等)学习并推荐更符合用户喜好的歌曲。

🧠 脑机交互任务中,该模型能够在没有显式反馈的情况下,根据后续的脑电波读数学习并输入正确的文本,具有较高的应用潜力。

🔄 该模型通过最大化解码奖励的差异来学习策略和奖励解码器,从而在部分反馈环境下实现最佳性能。

🚀 未来研究将尝试放宽条件独立性假设,以扩展该技术的应用范围。

Here's an example of a research question that started as a practical concern and ended up having science-fiction levels of potential.

A Practical Question and Answer

Contextual bandits have developed from research prototype to maturing industrial technology. CBs are practical because they incorporate the partial feedback nature of decision making in the real world, while sidestepping difficult issues of credit assignment and planning. In other words: you get a reward for the action you've taken, but you don't know what the reward would have been if you had done something else (partial feedback); but the reward is assumed to only depend upon your current action and not any decisions you have made previously (avoiding credit assignment and planning issues by assumption). Although there are no true contextual bandits in real life (the world is not actually forgetting our previous decisions and resetting itself), “all models are wrong, but some models are useful.”

Due to my close affiliation with a commercial self-serve contextual bandit platform, I experience some of the practical difficulties. One issue of several: customers do not know how to define rewards. Users have all sorts of reactions to decisions made by our system, and it's not clear how to value them. For example, a recommended song might be liked, added to a specific playlist, or downloaded for offline use. Yet if a contextual bandit chooses to recommend an song, it expects a scalar value reward indicating the quality of the decision. A natural question to ask is: can we just learn from the reactions directly? Surpisingly the answer is yes.

In this paper (excellent work by research intern Tengyang Xie) we introduce the IGL setting, which is a modification of the contextual bandit setting. The generative model is
    Environment generates tuple $(x, y, r) \sim D$ Player receives $x$ and chooses $a$. Environment reveals (action-dependent) feedback $y_a$. $r_a$ is never revealed, but player's goal is low regret with respect to $r$.

We don't understand all the information-theoretic limits yet. Clearly if $y=0$ a.s. then Player cannot succeed; we at least need the reward to be decodable from the feedback. But even assuming decodability there are impossibility results, e.g., a bandit setting with two actions and two functions ${ \psi_1, \psi_2 }$ such that $\psi1(y) = 1{a=1}$ and $\psi2(y)=1{a=2}$. Because of symmetry, given two versions of this scenario where the true $r$ is either $r=1{a=1}$ or $r=1{a=2}$ we must fail at one of them.

In the paper we identify a sufficient additional condition for success: $y \perp x, a|r$. In English this says “the feedback is conditionally independent of context and action given the reward.” This is a strong assumption, but plausibly approximately true in some real problems. In our song recommendation example above, this assumption says the distributions of feedback (likes, save to playlist, offline download) depends only upon how much the user likes the song and not, e.g., the particular genre of the song, the time of day, or whether the user is on a mobile vs. desktop device.

Again “all models are wrong, but some models are useful”, and the conditional independence assumption allowed us to make progress in the paper. In particular it allowed us to define an objective function, based only upon observable quantities, which nonetheless maximizes unobserved reward. The objective function operates on a tuple $(\pi, \psi)$, where $\pi: X \to A$ is a policy and $\psi: Y \to R$ is a reward decoder, $$\mathcal{L}(\pi, \psi) \doteq \mathbb{E}_{\substack{(x, y) \sim D \ a \sim \pi}}\left[ \psi(ya) \right] - \mathbb{E}{\substack{(x, y) \sim D \ a \sim \pi_{\text{bad}}}}\left[ \psi(ya) \right].$$ In English this says “the difference between the decoded reward of the policy and the decoded reward of another 'bad' policy”. $\pi{\text{bad}}$ is a policy which is known to have bad performance, e.g., the uniform-at-random policy when rewards are rare and there are more than 2 actions. Intuitively, taking differences is necessary to prevent the reward decoder from just saying everything is great. Surprisingly, we prove that maximizing $\mathcal{L}$, corresponding to jointly learning a policy and a reward decoder, also maximizes the reward of policy $\pi$ part of the tuple.

For online recommendation scenarios, the conditional independence assumption is viable and we will be testing the above approach on production workflows. However, as follow-on work we are trying to relax the conditional independence assumption to broaden the application scope of this technique.

Future Potential

There are many exciting applications of this concept. For example, consider the brain-computer interaction (BCI) task of typing a message based upon EEG sensor readings. Unfortunately subjects can have extreme medical difficulties that prevent easily eliciting supervised feedback, and even subjects capable of providing supervised feedback are burdened with the overhead of (re)training the system to their patterns. What if the system could learn to type the right thing based upon the subsequent EEG readings without explicit feedback? Even if some supervised pre-training is required, it would be extremely valuable if a system could stay performant as the sensor readings statistically drift over time. More generally, self-tuning would enhance the value of wearables and contribute to the rise of augmented humanity.

In the BCI scenario, both the context and the feedback are brain recordings at different points in time, e.g., synchronously scheduled into alternating time windows of "issue command" and "provide feedback". The conditional independence assumption is implausible here, so we are considering other approaches (e.g., using the above approach but try to “remove” the dependent parts of the feedback; or develop an alternative objective function which is more robust).

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

上下文 bandit 部分反馈 条件独立性 音乐推荐 脑机交互
相关文章