Assume that our data follow the true distribution *g*(*y*), and our statistical model to approximate *g*(*y*) is *f*(*y*).

(17) |

Properties of KL information number:

*Pf*. For simplicity, we only consider the discrete case here. Assume that *p* and *q* are probability distributions satisfying and .

Let for . attains its maximum value 0 only at . Thus and the equality holds only when . By putting , we have

It follows thatThe equality only holds when . Q.E.D.

Thus we know as .

It can be shown that -*I*(*g*;*f*) is the *entropy*. To minimize the KL information number is to maximize entropy.

How to estimate *I*(*g*;*f*)? From (17) we have

(20) |

as . Therefore, can replace the KL information number as a criterion of evaluating models. One wants to find the greatest possible .

**Log-likelihood**-
(21) **Likelihood**-
(22)

**Example:** Consider , i.e.,

(23) |

(24) |