$$ \newcommand{\dint}{\text{d}} \newcommand{\vphi}{\boldsymbol{\phi}} \newcommand{\vpi}{\boldsymbol{\pi}} \newcommand{\vpsi}{\boldsymbol{\psi}} \newcommand{\vomg}{\boldsymbol{\omega}} \newcommand{\vsigma}{\boldsymbol{\sigma}} \newcommand{\vzeta}{\boldsymbol{\zeta}} \renewcommand{\vx}{\mathbf{x}} \renewcommand{\vy}{\mathbf{y}} \renewcommand{\vz}{\mathbf{z}} \renewcommand{\vh}{\mathbf{h}} \renewcommand{\b}{\mathbf} \renewcommand{\vec}{\text{vec}} \newcommand{\vecemph}{\text{\emph{vec}}} \newcommand{\mvn}{\mathcal{MN}} \newcommand{\G}{\mathcal{G}} \newcommand{\M}{\mathcal{M}} \newcommand{\N}{\mathcal{N}} \newcommand{\diag}[1]{\text{diag}(#1)} \newcommand{\diagemph}[1]{\text{\emph{diag}}(#1)} \newcommand{\tr}[1]{\text{tr}(#1)} \renewcommand{\C}{\mathbb{C}} \renewcommand{\R}{\mathbb{R}} \renewcommand{\E}{\mathbb{E}} \newcommand{\D}{\mathcal{D}} \newcommand{\inner}[1]{\langle #1 \rangle} \newcommand{\innerbig}[1]{\left \langle #1 \right \rangle} \newcommand{\abs}[1]{\lvert #1 \rvert} \newcommand{\norm}[1]{\lVert #1 \rVert} \newcommand{\two}{\text{II}} \newcommand{\GL}{\text{GL}} \newcommand{\Id}{\text{Id}} \newcommand{\grad}[1]{\text{grad} \, #1} $$

The Benefits of Cross Entropy Loss

$$ \newcommand{\argmax}{\mathop{\mathrm{argmax}}} \newcommand{\argmin}{\mathop{\mathrm{argmin}}} \newcommand{\vect}[1]{ \boldsymbol{#1} } \newcommand{\batch}[1]{ ^{({#1})} } \newcommand{\grad}[1]{ \nabla#1 } \newcommand{\gradWrt}[2]{ \nabla_{#2}#1 } \newcommand{\gradDir}[1]{ \frac{ \grad{#1} }{ \| \grad{#1} \|} } \newcommand{\gradDirWrt}[2]{ \frac{ \gradWrt{#1}{#2} }{\| \gradWrt{#1}{#2} \|} } \newcommand{\partialD}[2]{ \frac{ \partial#1 }{ \partial#2 } } \newcommand{\partialDTwo}[3]{ \frac{ \partial#1 }{ \partial#2\partial#3 } } \newcommand{\derivativeWrt}[2]{ \frac{ d#1 }{ d#2 } } \newcommand{\L}{ \mathcal{L} } \newcommand{\P}{ P } \newcommand{\D}{ D } \newcommand{\R}{ \mathbb{R} } \newcommand{\H}{ \boldsymbol{H} } \newcommand{\y}{ \vect{y} }\hat{x}^{(k)} \newcommand{\x}{ \vect{x} } \newcommand{\model}{ f(\x,\theta) } $$

Cross entropy loss is almost always used for classification problems in machine learning. I thought it would be interesting to look into the theory and reasoning behind it’s wide usage. Not as much as I expected was written on the subject, but from what little I could find I learned a few interesting things. This post will be more about explaining the justification and benefits of cross entropy loss rather than explaining what cross entropy actually is. Therefore, if you don’t know what cross entropy is, there are many great sources on the internet that will explain it much better than I ever could so please learn about cross entropy before continuing.

Theoretical Justification

It’s important to know why cross entropy makes sense as a loss function. Under the framework of maximum likelihood estimation, the goal of machine learning is to maximize the likelihood of our parameters given our data, which is equivalent to the probability of our data given our parameters:

where is our dataset (a set of pairs of input and target vectors and ) and is our model parameters.

Since the dataset has multiple datum, the conditional probability can be rewritten as a joint probability of per example probabilities. Note that we are assuming that our data is independent and identically distributed. This assumption allows us to compute the joint probability by simply multiplying the per example conditional probabilities:

And since the logarithm is a monotonic function, maximizing the likelihood is equivalent to minimizing the negative log-likelihood of our parameters given our data.

In classification models, often the output vector is interpreted as a categorical probability distribution and thus we have:

where is the model output and is the index of the correct category.

Notice the cross entropy of the output vector is equal to because our “true” distribution is a one hot vector:

where is the one hot encoded target vector.

So in total we have:

Thus we have shown that maximizing the likelihood of a classification model is equivalent to minimizing the cross entropy of the models categorical output vector and thus cross entropy loss has a valid theoretical justification.

Numerical Stability

One thing you might ask is what difference would it make by using log probabilities instead of just the probabilities themselves given the logarithm is monotonic? Well one of the main reasons lies in its property of numerical stability. As demonstrated in the section above, in order to compute the likelihood of the model, we need to calculate a joint probability over each dataset example. This involves multiplying all the per example probabilities together:

This pi product becomes very tiny. Consider the case where each probability is around 0.01 (e.g trying to predict a class over 100 classes) and we are using a batch size of 128. The joint probability would be around which is definitely low enough to cause arithmetic underflow.

However this issue can be avoided with log probabilities because by the product rule of logarithms, we can turn the pi product of probabilities inside the logarithm into a sum of logarithms:

Using log-probabilities keeps the values in a reasonable range. It also keeps computing the gradient simple because it is easier to aggregate the gradient of a sum of functions than it is a product of functions.

Well-behaved Gradients

Using log-probabilities has the additional effect of keeping gradients from varying too widely. Many probability distributions we deal with in machine learning belong to the exponential family. Take for example a normal distribution:

Notice what happens when we turn this into a negative log-probability and take the derivative:

Notice the derivative will give us exactly the right value () after the update rule to maximize the likelihood if our learning rate is set to 1.

Obviously this is just a toy example but other probability distributions of the exponential family will also do quite well as the log probability is at most polynomial in the parameters. This keeps the gradient from varying widely and means that the same learning rate should give us consistent step sizes.

Also consider the common case of using softmax before the cross entropy loss:

Recall that the cross entropy loss is given by:

Now try compute the gradient of the cross entropy loss w.r.t .

As we can see this gives us a well-behaved gradient that is bounded by for each component.

References

  1. Morgan Giraud. ML notes: Why the log-likelihood?
  2. Rob DiPietro. A Friendly Introduction to Cross-Entropy Loss