Loading [MathJax]/jax/output/HTML-CSS/jax.js

Relation between Maximum Likelihood and KL-Divergence

Maximizing likelihood is equivalent to minimizing KL-Divergence. Considering P(θ) is the true distribution and P(θ) is our estimate, we already know that KL-Divergence is written as:

DKL[P(x|θ)||P(x|θ)]=ExP(x|θ)[logP(x|θ)P(x|θ)]=ExP(x|θ)[logP(x|θ)logP(x|θ)]=ExP(x|θ)[logP(x|θ)]ExP(x|θ)[logP(x|θ)]

On the right side of the equation, the first term is the entropy of the true distribution. It does not depend on the estimated parameter θ so we can ignore it.

Suppose we have n observations of the distribution xP(x|θ). Then, Law of Large Numbers says that as n goes to infinity,

1nni=1logP(xi|θ)=ExP(x|θ)[logP(x|θ)]

which gives the second term of above KL divergence equation. Notice that ni=1logP(xiθ) is the negative log-likelihood of a distribution. Then if we minimize DKL[P(xθ)∣∣P(xθ)], it is equivalent to minimizing the negative log-likelihood, in other words, it is equivalent to maximizing log-likelihood. This is important because this gives MLE a nice interpretation: maximizing the likelihood of data under our estimate is equal to minimizing the difference between our estimate and the real data distribution. We can see MLE as a proxy for fitting our estimate to the real distribution, which cannot be done directly as the real distribution is unknown to us.