Sigmoid Change

Step functions enforce some threshold on a variable e.g.: discard anything below a certain value. Given $u(x)$ any value equal to or greater than $0$ would trigger a positive output ($1$ in this case), everything else just equals zero. With $3u(x-8)$ everything equal to or greater than $8$ triggers a non-zero output ($3$ in this case).

Step functions are like switches. We could utilise them to trigger a signal whenever a threshold is crossed. Imagine opening the floodgates whenever the water level crosses a certain point or turning of the electric lighting when the ambient light brightness exceeds a given measure.

Perceptrons, binary classifiers and the neurons in artificial neural networks require an activation function in order to produce any output. The step function is a valid option for the activation function but poses a challenge in analysis because of the jump disconuity at $x=0$.

At $x=0$ the derivative is undefined ($\frac{\infty}{0}$) while the derivative is zero for the remainder of the domain.

The sigmoid function $\sigma(x)$, because of its differentiability :wink:, is sometimes used as an alternative to the step function $u(x)$.

Why care for differentiability? Well, when determining the coefficients for a neural network methods such as back propagation depend on the ability to determine the rate of change in the network output per change in the coefficients. Imagine that we assign coefficients at random initially. Through back propagation, we attempt to change every coefficient and determine how quickly each change converges the output of the network towards target value (given a certain set of input values). We choose to enforce the more favorable change and proceed towards the next iteration in which we repeat the process until the output of the network is sufficiently close to the target value. This process involves as much partial derivatives as we have coefficients to determine.

Some may choose to introduce a weight $\beta$ to the input variables in order to obtain a sigmoid curve more reminiscent of the step we’re trying to mimic. By tweaking the $\beta$ term, one can obtain a result that approaches the step function, yet remains differentiable throughout its entire domain :smile:. The graph below demonstrates the output of a sigmoid function with a $\beta$ of $1$, $2$, $10$ and $100$, where a plot of $\beta = 100$ approximates the step function quite closely in comparison to the other plots.

However enticing, during this post we’ll keep our eyes set on the unweighted sigmoid.

Derivation of the Sigmoid

Let’s find the derivative of the unweighted sigmoid.

Let’s stick to the Lagrange notation for a bit where we notate the derivative to $f(x)$ as $f’(x)$. Given $f(x) = \frac{a(x)}{b(x)}$, the quotient rule would state that:

Now that we know how to work out the derivative to a quotient we can basically work out $a’(x)$ and $b’(x)$ in order to fill in the blanks later.

Working out the math for the derivative to $\frac{1}{1+e^{-x}}$ leads to the following result, given the quotient rule:

It’s pretty clear that $e^{-x}$ is equal to $(1+e^{-x})-1$. That part is trivial to understand, however; it is pretty brilliant to have the insight to organize the tokens as such in order to be able to eliminate tokens in the next step through this substitution.

Go the extra mile to separate the expression into two separate terms:

We can simplify the first term since $(1+e^{-x})$ occurs in both the numerator and denominator. The second term can be simplified by expressing the value as the square of something which is simply $1 = 1^2$, yet this operation paves the way to squaring the entire term since $\frac{a^n}{b^n} = (\frac{a}{b})^n$.

Since $\sigma(x)$ equals $\frac{1}{1+e^{-x}}$ we can simplify our result to:

which gives us the beautiful derivative of a sigmoid.

They don’t make it any simpler than this :wink:.