The really interesting thing is the difference between the entropy and the cross-entropy. That difference is how much longer our messages are because we used a code optimized for a different distribution. If the distributions are the same, this difference will be zero. As the difference grows, it will get bigger.
We call this difference the Kullback–Leibler divergence, or just the KL divergence. The KL divergence of \(p\) with respect to \(q\), \(D_{q}(p)\), is defined as
\begin{equation} D_{q}(p) = H_{q}(p) - H(p). \end{equation}The really neat thing about KL divergence is that it’s like a distance between two distributions. It measures how different they are! (If you take that idea seriously, you end up with information geometry.)
Cross-Entropy and KL divergence are incredibly useful in machine learning. Often, we want one distribution to be close to another. For example, we might want a predicted distribution to be close to the ground truth. KL divergence gives us a natural way to do this, and so it shows up everywhere.