Primers • Activation Functions
- Overview
- Sigmoid Function
- Hyperbolic Tangent (tanh):
- Rectified Linear Unit (ReLU):
- Softmax:
- Further Reading
Overview
- Activation functions essentially decide whether a neuron should be ‘activated’ or not.
- They are applied to the output of each neuron in a neural network and decide whether the neuron is critical to the network or not.
- The activation function determines the output of a neuron given an input.
- It is used to introduce non-linearity into the network which is important to model complex, non-linear relationships between the input and output.
Image source from TheAiEdge,io
Sigmoid Function
- The sigmoid activation function maps any input value to an output value between 0 and 1.
- The sigmoid activation function is often used for binary classification problem where we have a clarity between different classes and the output is 0 or 1.
- “Some drawbacks of this activation that have been noted in the literature are: sharp damp gradients during backpropagation from deeper hidden layers to inputs, gradient saturation, and slow convergence.” Source
Hyperbolic Tangent (tanh):
- The tanh activation function maps any input value to a value between -1 and 1 and thus, is often used for modeling continuous outputs in the range [-1, 1].
- For example, it is commonly used in recurrent neural networks (RNNs) and long short-term memory (LSTM) networks to model sequential data.
- “Historically, the tanh function became preferred over the sigmoid function as it gave better performance for multi-layer neural networks.
- But it did not solve the vanishing gradient problem that sigmoids suffered, which was tackled more effectively with the introduction of ReLU activations.” Source
Rectified Linear Unit (ReLU):
- These are activation functions that produce output in the range [0, +inf) and thus the ReLU activation function maps any negative input value to 0 and leaves positive values unchanged.
- These activation functions are often used for hidden layers in feedforward neural networks and are well-suited for problems with non-linear relationships in the positive part of the input space.
- ReLUs functions are linear in the positive dimension but zero in the negative dimension.
- ReLU activations tackle vanishing gradient problems that sigmoids suffered from.
Leaky ReLU:
- The leaky ReLU is a variant of the ReLU activation function.
- The leaky ReLU allows for small, non-zero values for negative inputs.
- It is defined as max(αx, x), where x is the input and α is a small positive constant.
- “Leaky Rectified Linear Unit, or Leaky ReLU, is a type of activation function based on a ReLU, but it has a small slope for negative values instead of a flat slope. The slope coefficient is determined before training, i.e. it is not learnt during training.
- This type of activation function is popular in tasks where we may suffer from sparse gradients, for example training generative adversarial networks.”Source
Softmax:
- Now here is where the confusion intensifies because we have a Softmax-Loss as well as a Softmax activation function.
- Softmax Loss, or cross-entropy loss as discused in the page here, is a commonly used loss function for multi-class classification problems.
- The Softmax loss is a combination of the logistic loss and the Softmax activation function.
- It measures the difference between the predicted probability distribution and the true label distribution.
- Softmax activation function is differentiable, which makes it possible to train neural networks using gradient-based optimization algorithms.
- The Softmax activation function maps the inputs of a network to a probability distribution over multiple classes.