少点错误 前天 08:17
最大熵原理:在已知期望值的情况下更新概率分布
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了在已知某些变量期望值的情况下,如何更新概率分布。当无法直接应用贝叶斯规则时,最大熵原理提供了一种 principled 的方法。通过最大化概率分布的熵,同时满足期望值约束,可以得到一个在已知信息之外做出最少假设的分布。这种方法可以推广到处理非传统形式的证据,并且在特定情况下能与贝叶斯更新产生相同结果。文章还介绍了相对熵的概念,以更灵活地处理非均匀先验分布。

💡 最大熵原理的核心在于,在满足已知信息(如期望值约束)的前提下,选择具有最高熵的概率分布。这等同于在已知信息之外,做出最少的假设,即认为我们对变量的了解最少。

🔄 当证据的形式不是直接的逻辑陈述(如“至少有40个绿色订单”)而是关于变量期望值的信息(如“平均每天有25个红色订单”)时,传统的贝叶斯更新可能难以直接应用。最大熵原理提供了一种处理这类“非标准”证据的方法。

⚖️ 最大熵更新过程可以被看作是贝叶斯更新的推广。当证据可以被表示为某个指示函数的期望值等于1时(例如,指示函数表示“Ng ≥ 40”),最大熵方法将得到与贝叶斯规则相同的更新结果,尤其是在先验分布为均匀分布的情况下。

📈 为了处理非均匀的先验分布,可以使用相对熵(KL散度的负值)来代替熵进行优化。最大化相对熵意味着在遵循先验信息的基础上,使新分布在已知信息之外的假设最少,这使得最大熵原理能够更广泛地与贝叶斯更新相协调。

Published on November 4, 2025 12:02 AM GMT

Jaynes’ Widget Problem[1]: How Do We Update On An Expected Value?

Mr A manages a widget factory. The factory produces widgets of three colors - red, yellow, green - and part of Mr A’s job is to decide how many widgets to paint each color. He wants to match today’s color mix to the mix of orders the factory will receive today, so he needs to make predictions about how many of today’s orders will be for red vs yellow vs green widgets.

The factory will receive some unknown number of orders for each color throughout the day - Nr red, Ny yellow, and Ng green orders. For simplicity, we will assume that Mr A starts out with a prior distribution P[Nr,Ny,Ng] under which:

… and then Mr A starts to update that prior on evidence.

You’re familiar with Bayes’ Rule, so you already know how to update on some kinds of evidence. For instance, if Mr A gets a call from the sales department saying “We have at least 40 orders for green widgets today!”, you know how to plug that into Bayes’ Rule:

P[Nr,Ny,Ng|Ng40]=1P[Ng40]P[Ng40|Nr,Ny,Ng]P[Nr,Ny,Ng]

=10.6I[Ng40]P[Nr]P[Ny]P[Ng]

… i.e. the posterior is still uniform, but with probability mass only on Ng40, and the normalization is different to reflect the narrower distribution.

But consider a different kind of evidence: Mr A goes through some past data, and concludes that the average number of red sales each day is 25, the average number of yellow sales is 50, and the average number of green sales is 5. So, Mr A would like to update on the information E[Nr]=25,E[Ny]=50,E[Ng]=5.

Chew on that for a moment.

That’s… not a standard Bayes’ Rule-style update situation. The information doesn’t even have the right type for Bayes’ Rule. It’s not a logical sentence about the variables (Nr,Ny,Ng), it’s a logical statement about the distribution itself. It’s a claim about the expected values which will live in Mr A’s mind, not the widget orders which will live out in the world. It’s evidence which didn’t come from observing (Nr,Ny,Ng), but rather from observing some other stuff and then propagating information through Mr A’s head.

… but at the same time, it seems like a kind of intuitively reasonable type of update to want to make. And we’re Bayesians, we don’t want to update in some ad-hoc way which won’t robustly generalize, so… is there some principled, robustly generalizable way to handle this type of update? If the information doesn’t have the right type signature for Bayes’ Rule, how do we update on it?

Enter Maxent

Here’s a handwavy argument: we started with a uniform prior because we wanted to assume as little as possible about the order counts, in some sense. Likewise, when we update on those expected values, we should assume as little as possible about the order counts while still satisfying those expected values.

Now for the big claim: in order to “assume as little as possible” about a random variable, we should use the distribution with highest entropy.

Conceptually: the entropy H((Nr,Ny,Ng)) tells us how many bits of information we expect to gain by observing the order counts. The less information we expect to gain by observing those counts, the more we must think we already know. A 50/50 coinflip has one bit of entropy; we learn one bit by observing it. A coinflip which we expect will come up heads with 100% chance has zero bits of entropy; we learn zero bits by observing it, because (we think) we already know the one bit which the coin flip nominally tells us. One less bit of expected information gain is one more bit which we implicitly think we already know. Conversely, one less bit which we think we already know means one more bit of entropy.

So, to assume as little as possible about what we already know… we should maximize our distribution’s entropy. We’ll maximize that entropy subject to constraints encoding the things we do want to assume we know - in this case, the expected values.

Spelled out in glorious mathematical detail, our update looks like this:

P[Nr,Ny,Ng|E[Nr]=25,E[Ny]=50,E[Ng]=5]=

argmaxQNr,Ny,NgQ[Nr,Ny,Ng]logQ[Nr,Ny,Ng]

 subject to NrQ[Nr]Nr=25,NyQ[Ny]Ny=50,NgQ[Ng]Ng=5

(... as well as the implicit constraints Q[Nr,Ny,Ng]0 and Nr,Ny,NgQ[Nr,Ny,Ng]=1, which make sure that Q is a probability distribution. We usually won’t write those out, but one does need to include them when actually calculating Q.)

Then we use the Standard Magic Formula for maxent distributions which we’re not going to derive here because this is a concepts post, which says

P[Nr,Ny,Ng|E[Nr]=25,E[Ny]=50,E[Ng]=5]=1ZeλrNr+λyNy+λgNg

… where the parameters λr,λy,λg and Z are chosen to match the expected value constraints and make the distribution sum to 1. (In this case, David's numerical check finds Z  17465.2, λr0.0349,λy0.0006,λg0.1823)

Some Special Cases To Check Our Intuition

We have a somewhat-handwavy story for why it makes sense to use this maxent machinery: the more information we expect to gain by observing a variable, the less we implicitly assume we already know about it. So, maximize expected information gain (i.e. minimize implicitly-assumed knowledge) subject to the constraints of whatever information we do think we know.

But to build confidence in that intuitive story, we should check that it does sane things in cases we already understand.

“No Information”

First, what does the maxent construction do when we don’t pass in any constraints? I.e. we don’t think we know anything relevant?

Well, it just gives the distribution with largest entropy over the outcomes, which turns out to be a uniform distribution. So in the case of our widgets problem, the maximum entropy construction with no constraints gives the same prior we specified up front, uniform over all outcomes.

Furthermore: what if the expected number of yellow orders, Ny, were 49.5 - the same as under the prior - and we only use that constraint? Conceptually, that constraint by itself would not add any information not already implied by the prior. And indeed, the maxent distribution would be the same as the trivial case: uniform.

Bayes Updates

Now for a more interesting class of special cases. Suppose, as earlier, that Mr A gets a call from the sales department saying “We have at least 40 orders for green widgets today!” - i.e. Ng40. This is a case where Mr A can use Bayes’ Rule, as we all know and love. But he could use a maxent update instead… and if he does so, he’ll get the same answer as Bayes’ Rule.

Here’s how.

Let’s think about the variable I[Ng40] - i.e. it’s 1 if there are 40 or more green orders, 0 otherwise. What does it mean if I claim E[I[Ng40]]=1? Well, that expectation is 1 if and only if all of the probability mass is on Ng4. In other words, E[I[Ng40]]=1 is synonymous with Ng4 (under the distribution).

So what happens when we find the maxent distribution subject to E[I[Ng40]]=1? Well, the Standard Magic Formula says

P[Nr,Ny,Ng|E[I[Ng40]]=1]=1ZeλI[Ng40]]

… where Z and λ are chosen to satisfy the constraints. In this case, we’ll need to take λ to be (positive) infinitely large, and Z to normalize it. In that limit, the probability will be 0 on Ng<40, and uniform on Ng40 - exactly the same as the Bayes update.

This generalizes: the same construction, with the expectation of an indicator function, can always be used in the maxent framework to get the same answer as a Bayes update on a uniform distribution.

… but uniform distributions aren’t always the right starting point, which brings us to the next key piece.

Relative Entropy and Priors

Our trick above to replicate a Bayes update using maximum entropy machinery only works insofar as the prior is uniform. And that points to a more general problem with this whole maxent approach: intuitively, it doesn’t seem like a uniform prior should always be my “assume as little as possible” starting point.

A toy example of the sort of problem which comes up: suppose two people are studying rolls of the same standard six-sided die. One of them studies extreme outcomes, and only cares whether the die rolls 6 or not, so as a preprocessing step they bin all the rolls into 6 or not-6. The other keeps the raw data on the rolls. Now, if they both use a uniform distribution, they get different distributions: one of them assigns probability ½ to a  roll of 6 (because 6 is one of the two preprocessed outcomes), the other assigns probability ⅙ to a roll of 6. Seems wrong! This maxent machine should have some kind of slot in it where we put in a distribution representing (in this case) how many things we binned together already. Or, more generally, a slot where we put in prior information which we want to take as already known/given, aside from the expectation constraints.

Enter relative entropy, the negative of KL divergence.

Relative entropy can be thought of as entropy relative to a reference distribution, which works like a prior. Intuitively:

In most cases, rather than maximizing entropy, it makes more sense to maximize relative entropy - i.e. minimize KL divergence - relative to some prior Q. (In the case of continuous variables, using relative entropy rather than entropy is an absolute necessity, for reasons we won’t get into here.)

The upshot: if we try to mimic a Bayes update in the maxent framework just like we did earlier, but we maximize entropy relative to a prior, we get the same result as a Bayes update - without needing to assume a uniform prior. Mathematically: let

P[Nr,Ny,Ng|E[I[Ng40]]=1]=

argminR DKL(R[Nr,Ny,Ng]||P[Nr,Ny,Ng])

 subject to ER[I[Ng40]]=1.

That optimization problem will spit out the standard Bayes-updated distribution

P[Nr,Ny,Ng|E[I[Ng40]]=1]=P[Nr,Ny,Ng|Ng40].

… and that is the last big piece in how we think of maxent machinery as a generalization of Bayes updates.

Recap

The key pieces to remember are:

… but in the cases which can be handled by Bayes’ Rule, updating via maxent yields the same answer.

  1. ^

    You can find Jaynes’ original problem starting on page 440 of Probability Theory: The Logic Of Science. The version I present here is similar but not identical; I have modified it to remove conceptual distractions about unnormalizable priors and to get the point of this post faster.

  2. ^

    I[] is the indicator function; it’s 1 if its inputs are true and 0 if its inputs are false.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

最大熵原理 概率更新 期望值 贝叶斯规则 相对熵 Maximum Entropy Principle Probability Update Expected Value Bayes' Rule Relative Entropy
相关文章