• This article will contain a lot of Deep Learning fundamental concepts.
  • Feel free to use this as reference as you advance in the field, or use it for interview prep!
  • And please let me know if you’d like to see something else on here, my linkedin and email are provided at vinija.ai!

Classification Algorithms

Logistic Regression

  • To start off here, Logistic Regression is a misnomer as it does not pertain to a regression problem at all.
  • Logistic regression estimates the probability of an event occurring based on a given dataset of independent variables. Since the outcome is a probability, the dependent variable is bounded between 0 and 1.
  • You can use your returned value in one of two ways:
    • You may just need an output of 0 or 1 and use it “as is”.
      • Say your model predicts the probability that your baby will cry at night as:
        • \[p(cry|night)= 0.05\]
      • Let’s leverage this information to see how many times a year you’ll have to wake up to soothe your baby:
      • \[wakeUp = p(cry|nights)*nights\]
      • \[= 0.05 * 365\]
      • \[= 18 days\]
    • Or you may want to convert it into a binary category such as: spam or not spam and convert it to a binary classification problem.
      • In this scenario, you’d want an accurate prediction of 0 or 1 and no probabilities in between.
      • The best way to obtain this is by leveraging the sigmoid function as it guarantees a value between 0 and 1.
      • Sigmoid function is represented as:
      • \[y^{\prime}=\frac{1}{1+e^{-z}}\]
      • where,
        • \(y^{\prime}\) is the output of the logistic regression model for a particular example.
        • \[z=b+w_{1} x_{1}+w_{2} x_{2}+\ldots+w_{N} x_{N}\]
        • The \(w\) values are the model’s learned weights, and \(b\) is the bias.
        • The \(x\) values are the feature values for a particular example.
  • Pros:
    • Algorithm is quite simple and efficient
    • Provides concrete probability scores as output
  • Cons:
    • Bad at handling a large number of categorical features.
    • It assumes that the data is free of missing values and predictors are independent of each other.
  • Use case:
    • Logistic Regression is used when the dependent variable(target) is categorical.
      • For example, To predict whether an email is spam (1) or (0) Whether the tumor is malignant (1) or not (0).

Naive Bayes Classifier

  • Naive Bayes is a binary or multi-class classifier.
  • Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes’ theorem
  • It is termed “Naive” because it has the “naive” assumption of conditional independence between every pair of features given the value of the class variable.
  • Bayes Theorem is represented in the image below:
  • The thought behind naive Bayes classification is to try to classify the data by maximizing:
  • \[P(O|Ci)P(Ci)\]
    • where,
      • \(O\) is the Object or tuple in a dataset
      • \(i\) is an index of the class
  • Pros:
    • This algorithm works very fast.
    • It can also be used to solve multi-class prediction problems.
    • This classifier performs better than other models with less training data if the assumption of independence of features holds.
  • Cons:
    • It assumes that all the features are independent. This is actually a big con because features in reality are frequently not fully independent.
  • Use case:
    • When assumption of independence holds between features.
      • Naive Bayes classifier will perform better than logistic regression if that holds true and it will require less training data.
    • It performs well in case of categorical input variables compared to numerical variable(s)

Regression Algorithms

Linear Regression

  • Linear regression analysis is used to predict the value of a variable based on the value of another variable.
  • The variable you want to predict is called the dependent variable.
  • The variable you are using to predict the other variable’s value is called the independent variable. Ref IBM
  • Assumes a linear relationship occurs between two variables and fits a linear equation on the data.
  • The goal of Linear Regression is to predict output values for inputs that are not present in the data set, with the belief that those outputs would fall on the line.
  • Pros:
    • Performs very well for linearly seperated data.
    • Easy to implement and is interpretable.
  • Cons:
    • Prone to noise and overfitting.
    • Very sensitive to outliers.
  • Use case:
    • Linear regression is commonly used for predictive analysis and modeling

Either Classification or Regression

K-Nearest Neighbor

  • “Birds of a feather flock together”
  • KNN is a supervised machine learning algorithm and KNN assumes that similar things exist in proximity.
  • “K” here stands for a number of your choosing, it represents the number of neighbors you’d like to look into as an output.
  • KNN answers the question that given the current data, what are the K most similar data points to the query.
    • KNN will calculate distance most commonly using either euclidean or manhattan distance:
    • Euclidean:
      • \[d(x, y)=\sqrt{\sum_{i=1}^{n}\left(y_{i}-x_{i}\right)^{2}}\]
    • \[\text { Manhattan Distance }=d(x, y)=\left(\sum_{i=1}^{m}\left|x_{i}-y_{i}\right|\right)\]
  • “The k-nearest neighbors algorithm, also known as KNN or k-NN, is a non-parametric, supervised learning classifier, which uses proximity to make classifications or predictions about the grouping of an individual data point.
  • While it can be used for either regression or classification problems, it is typically used as a classification algorithm, working off the assumption that similar points can be found near one another.” Ref: IBM
  • This is the high-level view of how the algorithm works:
    • For each example in the data:
      • Calculate distance b/w query example and current example form the data
      • Add the distance and index to an ordered collection
      • Sort in ascending order by distance
      • Pick first K from sorted order
      • Get labels of selected K entries
      • If regression return mean of K labels
      • If classification, return the model
  • Pros:
    • Easy to implement
    • Needs only a few hyperparameters which are:
      • The value of K
      • Distance metric used
  • Cons:
    • Does not scale well as it takes too much memory and data storage compared with other classifiers
    • Prone to overfitting if the value of K is too low and will underfit if the value of K is too high.
  • Use case:
    • When labelled data is too expensive or impossible to obtain.
    • When the dataset is relatively smaller and is noise free.

Support Vector Machine

  • Objective of SVM is to find a hyperplane in an N-dimensional space(N number of features) that distinctly classifies the data points.
    • Hyperplane is the decision boundary that helps classify the data points
  • If the number of input features is 2, hyperplane is just a line, if input features is 3, it becomes a 2D plane.
  • In the instance that the dimension is greater than 2, we need to use the kernel trick. The SVM kernel is a function that takes low dimensional input space and transforms it into higher-dimensional space, ie it converts not separable problem to separable problem. It is mostly useful in non-linear separation problems.
  • The image below displays the linear hyperplane separating the two classes such that the hyperplane whose distance from it to the nearest data point on each side is maximized. This hyperplane is known as the maximum-margin hyperplane/hard margin.
  • Pros:
    • Also works welll when there is a clear margin of separation between classes.
    • Its easy to customize the kernel function depending on your dimensionality.
    • Its memory efficient as it uses a subset of training points in the decision function called support vectors
  • Cons:
    • It doesn’t perform well when we have large data set because the required training time is higher.
    • It also doesn’t perform very well, when the data set has more noise i.e. target classes are overlapping.
  • Use case:
    • While SVMs can be used for regression problems, its best used for classification.
    • Works great if your number of features is high and in high dimensional spaces.

Decision Tree

  • Like the name suggests, a Decision Tree is a tree with a flowchart like structure consisting of 3 elements.
    • The internal node denotes a test on an attribute.
    • Each branch represents an outcome of the test.
    • Each leaf node (terminal node) holds a class label.
  • The objective of a Decision Tree is to create a training model that can to predict the class of the target variable by learning simple decision rules inferred from prior data(training data).
  • Pros:
    • Interpretability is high due to the intuitive nature of a tree.
  • Cons:
    • Small changes in data can lead to large structural changes on the tree.
  • Use case:
    • When you want to be able to lay out all the possible outcomes of a problem and work on challenging each option.


  • XGBoost algorithm is a gradient boosting algorithm that is highly efficient and scalable.
  • Here’s a high-level overview of the XGBoost algorithm:
    1. Initialize the model with a weak learner, usually a decision tree stump (a decision tree with a single split)
    2. Compute the negative gradient of the loss function with respect to the current prediction
    3. Fit a decision tree to the negative gradient to make a new prediction
    4. Add the prediction from this tree to the current prediction
    5. Repeat steps 2-4 for a specified number of trees, or until a stopping criterion is met
    6. Combine the predictions from all trees to get the final prediction
  • Source

Gradient Boosting

  • Gradient Boosting is an ensemble machine learning algorithm that combines multiple weak models to create a strong model.
  • It is an iterative process where each iteration, a new model is fit to the residual errors made by the previous model, with the goal of decreasing the overall prediction error.
  • The algorithm works as follows:
    1. Initialize the model with a weak learner, typically a decision tree with a single split.
    2. Compute the negative gradient of the loss function with respect to the current prediction.
    3. Fit a new model to the negative gradient.
    4. Update the prediction by adding the prediction from the new model.
    5. Repeat steps 2-4 for a specified number of iterations, or until a stopping criterion is met.
    6. Combine the predictions from all models to get the final prediction.