Assume that our data follow the true distribution g(y), and our statistical model to approximate g(y) is f(y).
Properties of KL information number:
Pf. For simplicity, we only consider the discrete case here. Assume that p and q are probability distributions satisfying and .
Let for . attains its maximum value 0 only at . Thus and the equality holds only when . By putting , we have
It follows that
The equality only holds when . Q.E.D.
Thus we know as .
It can be shown that -I(g;f) is the entropy. To minimize the KL information number is to maximize entropy.
How to estimate I(g;f)? From (17) we have
as . Therefore, can replace the KL information number as a criterion of evaluating models. One wants to find the greatest possible .
Example: Consider , i.e.,