Overview

  • Optimizers are algorithms used to optimize the performance of a model by adjusting the model parameters during training.
  • During the training process, a model is presented with a set of training data and a corresponding set of labels or targets.
  • The model then makes predictions on the data, and the predictions are compared to the labels to calculate an error or loss function that measures how well the model is performing.
  • The goal of optimization is to minimize this loss function by adjusting the model parameters, such as the weights and biases of a neural network.
  • Optimizers are used to adjust the model parameters in an iterative process during training.
  • At each iteration, the optimizer computes the gradients of the loss function with respect to the model parameters and then updates the parameters to move in the direction of minimizing the loss.
  • The learning rate, which determines the step size of the update, is a hyperparameter that is set by the user.
  • Optimizers are an essential component of the training process in ML/DL and are used to adjust the model parameters to minimize the loss function and improve the performance of the model.
  • Below we will talk about some of the most common optimizers:

Adam

  • Adam combines ideas from both RMSprop and momentum to achieve efficient and robust optimization.
  • The key idea behind Adam is to adapt the learning rate for each parameter based on estimates of the first and second moments of the gradients.
def adam(cost_func, grad_func, x_init, learning_rate, beta1=0.9, beta2=0.999, epsilon=1e-8, max_iter=1000, tol=1e-6):
    x = x_init.copy()
    m = np.zeros_like(x)
    v = np.zeros_like(x)
    for i in range(max_iter):
        g = grad_func(x)
        m = beta1 * m + (1 - beta1) * g
        v = beta2 * v + (1 - beta2) * g**2
        m_hat = m / (1 - beta1**(i+1))
        v_hat = v / (1 - beta2**(i+1))
        x_prev = x.copy()
        x -= learning_rate * m_hat / (np.sqrt(v_hat) + epsilon)
        if np.linalg.norm(x - x_prev) < tol:
            break
    return x

  • In this implementation:
    • cost_func is the objective function to be minimized,
    • grad_func is its gradient, x_init is the initial parameter vector,
    • learning_rate is the step size,
    • beta1 and beta2 are the decay rates,
    • epsilon is a small constant added for numerical stability,
    • max_iter is the maximum number of iterations that Adam should perform before stopping

SGD

  • Stochastic Gradient Descent (SGD) is particularly useful when dealing with large datasets.
  • This is because it allows for efficient training of models by updating the model parameters based on a small subset of the training data (a mini-batch) at a time.
  • The limitations of SGD include:
    • sensitivity to the learning rate and the mini-batch size
    • a tendency to get stuck in local minima
  • The optimizers below help to solve these limitations.

The basic idea behind SGD is to iteratively update the model parameters in the direction of the negative gradient of the loss function with respect to the parameters.

  • \[\theta_{t+1} = \theta_{t} - \alpha \nabla L(\theta_{t}, x_{i}, y_{i})\]
  • where:
  • \(\theta_t\) is the parameter vector at iteration t,
  • \(\alpha\) is the learning rate, L is the loss function,
  • and (x_i, y_i) is a mini-batch of training data.
  • The symbol \(\nabla\) denotes the gradient operator.
# Vanilla Gradient Descent

for epoch in range(num_epochs):
    predictions = model(training_data)
    loss = loss_function(predictions, ground_truth)
    weights_grad = evaluate_gradient(loss) # using torch by calling `loss.backward()`
    weights -= learning_rate * weights_grad # perform parameter update

SGD with Momentum

  • Stochastic Gradient Descent with Momentum is an extension of the standard SGD algorithm that incorporates a momentum term to help accelerate convergence and reduce oscillations during training.
  • The momentum term acts as a moving average of the gradients, allowing the algorithm to continue moving in the same direction, even if the current gradient is small.
  • \[v_t = \gamma v_{t-1} + \alpha \nabla_{\theta} J(\theta)\]
  • \[\theta = \theta - v_t\]
  • where:
  • \(\theta\) is the parameter vector,
  • \(J(\theta)\) is the cost function,
  • \(\alpha\) is the learning rate,
  • and \(\gamma\) is the momentum coefficient.
def sgd_momentum(cost_func, grad_func, x_init, learning_rate, momentum_coef, max_iter=1000, tol=1e-6):
    x = x_init.copy()
    v = np.zeros_like(x)
    for i in range(max_iter):
        grad = grad_func(x)
        v = momentum_coef * v + learning_rate * grad
        x_prev = x.copy()
        x -= v
        if np.linalg.norm(x - x_prev) < tol:
            break
    return x

-In this implementation: - cost_func is the objective function to be minimized, - grad_func is its gradient, x_init is the initial parameter vector, - learning_rate is the step size, momentum_coef is the momentum coefficient, - max_iter is the maximum number of iterations, - and tol is the tolerance for convergence.

RMSprop

  • MSprop is an optimization algorithm that is commonly used to optimize deep neural networks.
  • It is a variation of stochastic gradient descent that adapts the learning rate based on the magnitude of recent gradients.
  • The key idea behind RMSprop is to divide the learning rate by a running average of the magnitudes of recent gradients.
def rmsprop(cost_func, grad_func, x_init, learning_rate, decay_rate=0.9, epsilon=1e-8, max_iter=1000, tol=1e-6):
    x = x_init.copy()
    E_g_squared = np.zeros_like(x)
    for i in range(max_iter):
        g = grad_func(x)
        E_g_squared = decay_rate * E_g_squared + (1 - decay_rate) * g**2
        x_prev = x.copy()
        x -= learning_rate * g / (np.sqrt(E_g_squared) + epsilon)
        if np.linalg.norm(x - x_prev) < tol:
            break
    return x

  • In this implementation:
    • cost_func is the objective function to be minimized,
    • grad_func is its gradient,
    • x_init is the initial parameter vector,
    • learning_rate is the step size,
    • decay_rate is the decay rate,
    • epsilon is a small constant added for numerical stability,
    • max_iter is the maximum number of iterations,
    • and tol is the tolerance for convergence.