Assume that our data follow the true distribution g(y), and our statistical model to approximate g(y) is f(y).
| |
(17) |
Properties of KL information number:
Pf. For simplicity, we only consider the discrete case here. Assume that p and q are probability distributions satisfying
and
.
Let
for
.
attains its maximum value 0 only at
. Thus
and the equality holds only when
. By putting
, we have
![]()
The equality only holds when
. Q.E.D.
Thus we know
as
.
It can be shown that -I(g;f) is the entropy. To minimize the KL information number is to maximize entropy.
How to estimate I(g;f)? From (17) we have
| (20) |
as
. Therefore,
can replace the KL information number as a criterion of evaluating models. One wants to find the greatest possible
.
![]() |
(21) |
![]() |
(22) |
Example: Consider
, i.e.,
| (23) |
| (24) |