Binary cross entropy derivative
Where D is also an arbitrary constant. A good choice is the maximum between all inputs, negated:. This will shift the inputs to a range close to zero, assuming the inputs themselves are not too far from each other. Negatives with large exponents "saturate" to zero rather than infinity, so we have a better chance of avoiding NaNs.
Note that this is still imperfect, since mathematically softmax would never really produce a zero, but this is much better than NaNs, and since the distance between the inputs is very large it's expected to get a result extremely close to zero anyway.
A common use of softmax appears in machine learning, in particular in logistic regression: In this diagram, we have an input x with N features, and T possible output classes. The weight matrix W is used to transform x into a vector with T elements called "logits" in ML folklore , and the softmax function is used to "collapse" the logits into a vector of probabilities denoting the probability of x belonging to each one of the T output classes.
How do we compute the derivative of this "softmax layer" fully-connected matrix multiplication followed by softmax? Using the chain rule, of course! You'll find any number of derivations of this derivative online, but I want to approach it from first principles, by carefully applying the multivariate chain rule to the Jacobians of the functions involved.
An important point before we get started: In fact, in machine learning we usually want to find the best weight matrix W , and thus it is W we want to update with every step of gradient descent. Therefore, we'll be computing the derivative of this layer w. Let's start by rewriting this diagram as a composition of vector functions.
First, we have the matrix multiplication, which we denote g W. Next we have the softmax. Overall, we have the function composition:. By applying the multivariate chain rule, the Jacobian of P W is:. We've computed the Jacobian of S a earlier in this post; what's remaining is the Jacobian of g W. Since g is a very simple function, computing its Jacobian is easy; the only complication is dealing with the indices correctly.
We have to keep track of which weight each derivative is for. In a sense, the weight matrix W is "linearized" to a vector of length NT. If you're familiar with the memory layout of multi-dimensional arrays , it should be easy to understand how it's done. In our case, one simple thing we can do is linearize it in row-major order, where the first row is consecutive, followed by the second row, etc.
Looking at it differently, if we split the index of W to i and j , we get:. Finally, to compute the full Jacobian of the softmax layer, we just do a dot product between DS and Dg. Note that P W: In literature you'll see a much shortened derivation of the derivative of the softmax layer. That's fine, since the two functions involved are simple and well known. If we carefully compute a dot product between a row in DS and a column in Dg:.
Dg is mostly zeros, so the end result is simpler. So it's entirely possible to compute the derivative of the softmax layer without actual Jacobian matrix multiplication; and that's good, because matrix multiplication is expensive! The reason we can avoid most computation is that the Jacobian of the fully-connected layer is sparse.
That said, I still felt it's important to show how this derivative comes to life from first principles based on the composition of Jacobians for the functions involved. The advantage of this approach is that it works exactly the same for more complex compositions of functions, where the "closed form" of the derivative for each element is much harder to compute otherwise. We've just seen how the softmax function is used as part of a machine learning network, and how to compute its derivative using the multivariate chain rule.
While we're at it, it's worth to take a look at a loss function that's commonly used along with softmax for training a network: Cross-entropy has an interesting probabilistic and information-theoretic interpretation, but here I'll just focus on the mechanics.
For two discrete probability distributions p and q , the cross-entropy function is defined as:. Where k goes over all the possible values of the random variable the distributions are defined for.
Specifically, in our case there are T output classes, so k would go from 1 to T. If we start from the softmax output P - this is one probability distribution . The other probability distribution is the "correct" classification output, usually denoted by Y. This is a one-hot encoded vector of size T , where all elements except one are 0.
Let's rephrase the cross-entropy loss formula for our domain:. P k is the probability of the class as predicted by the model. Y k is the "true" probability of the class as provided by the data. Actually, let's make it a function of just P , treating y as a constant. The Jacobian of xent is a 1xT matrix a row vector , since the output is a scalar and we have T inputs the vector P has T elements:.
Now recall that P can be expressed as a function of input weights: So we have another function composition:. And we can, once again, use the multivariate chain rule to find the gradient of xent w. Let's check that the dimensions of the Jacobian matrices work out. Dxent P W is 1xT , so the resulting Jacobian Dxent W is 1xNT , which makes sense because the whole network has one output the cross-entropy loss - a scalar value and NT inputs the weights.
Here again, there's a straightforward way to find a simple formula for Dxent W , since many elements in the matrix multiplication end up cancelling out.
Note that xent P depends only on the y -th element of P. Recall that the row vector represents the whole weight matrix W "linearized" in row-major order. Once again, even though in this case the end result is nice and clean, it didn't necessarily have to be so.
The technique of multiplying Jacobian matrices is oblivious to all this, as the computer can do all the sums for us. All we have to do is compute the individial Jacobians, which is usually easier because they are for simpler, non-composed functions.
This is the beauty and utility of the multivariate chain rule. For comments, please send me an email , or reach out on on Twitter. You can find the links to the rest of the tutorial here:. The previous intermezzo described how to do a classification of 2 classes with the help of the logistic function. For multiclass classification there exists an extension of this logistic function called the softmax function which is used in multinomial logistic regression.
The following section will explain the softmax function and how to derive it. This logistic function can be generalized to output a multiclass categorical probability distribution by the softmax function.
This function is a normalized exponential and is defined as:. To use the softmax function in neural networks, we need to compute its derivative. The maximization of this likelihood can be written as:. Which can be written as a conditional distribution:. As was noted during the derivation of the cost function of the logistic function, maximizing this likelihood can also be done by minimizing the negative log-likelihood:. This post at peterroelants. Link to the full IPython notebook file.
Toggle navigation Peter's notes.