on
1 min to read.
Relation between Maximum Likelihood and KL-Divergence
Maximizing likelihood is equivalent to minimizing KL-Divergence. Considering P(⋅∣θ∗) is the true distribution and P(⋅∣θ) is our estimate, we already know that KL-Divergence is written as:
DKL[P(x|θ∗)||P(x|θ)]=Ex∼P(x|θ∗)[logP(x|θ∗)P(x|θ)]=Ex∼P(x|θ∗)[logP(x|θ∗)−logP(x|θ)]=Ex∼P(x|θ∗)[logP(x|θ∗)]−Ex∼P(x|θ∗)[logP(x|θ)]On the right side of the equation, the first term is the entropy of the true distribution. It does not depend on the estimated parameter θ so we can ignore it.
Suppose we have n observations of the distribution x∼P(x|θ∗). Then, Law of Large Numbers says that as n goes to infinity,
−1nn∑i=1logP(xi|θ)=−Ex∼P(x|θ∗)[logP(x|θ)]which gives the second term of above KL divergence equation. Notice that −∑ni=1logP(xi∣θ) is the negative log-likelihood of a distribution. Then if we minimize DKL[P(x∣θ∗)∣∣P(x∣θ)], it is equivalent to minimizing the negative log-likelihood, in other words, it is equivalent to maximizing log-likelihood. This is important because this gives MLE a nice interpretation: maximizing the likelihood of data under our estimate is equal to minimizing the difference between our estimate and the real data distribution. We can see MLE as a proxy for fitting our estimate to the real distribution, which cannot be done directly as the real distribution is unknown to us.