ML Algorithms
 Overview
 Classification Algorithms
 Regression Algorithms
 Either Classification or Regression
 References
Overview
 This article will contain a lot of Deep Learning fundamental concepts.
 Feel free to use this as reference as you advance in the field, or use it for interview prep!

And please let me know if you’d like to see something else on here, my linkedin and email are provided at vinija.ai!
Classification Algorithms
Logistic Regression
 To start off here, Logistic Regression is a misnomer as it does not pertain to a regression problem at all.
 Logistic regression estimates the probability of an event occurring based on a given dataset of independent variables. Since the outcome is a probability, the dependent variable is bounded between 0 and 1.
 You can use your returned value in one of two ways:
 You may just need an output of 0 or 1 and use it “as is”.
 Say your model predicts the probability that your baby will cry at night as:
 \[p(crynight)= 0.05\]
 Let’s leverage this information to see how many times a year you’ll have to wake up to soothe your baby:
 \[wakeUp = p(crynights)*nights\]
 \[= 0.05 * 365\]
 \[= 18 days\]
 Say your model predicts the probability that your baby will cry at night as:
 Or you may want to convert it into a binary category such as: spam or not spam and convert it to a binary classification problem.
 In this scenario, you’d want an accurate prediction of 0 or 1 and no probabilities in between.
 The best way to obtain this is by leveraging the sigmoid function as it guarantees a value between 0 and 1.
 Sigmoid function is represented as:
 \[y^{\prime}=\frac{1}{1+e^{z}}\]
 where,
 \(y^{\prime}\) is the output of the logistic regression model for a particular example.
 \[z=b+w_{1} x_{1}+w_{2} x_{2}+\ldots+w_{N} x_{N}\]
 The \(w\) values are the model’s learned weights, and \(b\) is the bias.
 The \(x\) values are the feature values for a particular example.
 You may just need an output of 0 or 1 and use it “as is”.
 Pros:
 Algorithm is quite simple and efficient
 Provides concrete probability scores as output
 Cons:
 Bad at handling a large number of categorical features.
 It assumes that the data is free of missing values and predictors are independent of each other.
 Use case:
 Logistic Regression is used when the dependent variable(target) is categorical.
 For example, To predict whether an email is spam (1) or (0) Whether the tumor is malignant (1) or not (0).
 Logistic Regression is used when the dependent variable(target) is categorical.
Naive Bayes Classifier
 Naive Bayes is a binary or multiclass classifier.
 Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes’ theorem
 It is termed “Naive” because it has the “naive” assumption of conditional independence between every pair of features given the value of the class variable.
 Bayes Theorem is represented in the image below:
 The thought behind naive Bayes classification is to try to classify the data by maximizing:

\[P(OCi)P(Ci)\]
 where,
 \(O\) is the Object or tuple in a dataset
 \(i\) is an index of the class
 where,
 Pros:
 This algorithm works very fast.
 It can also be used to solve multiclass prediction problems.
 This classifier performs better than other models with less training data if the assumption of independence of features holds.
 Cons:
 It assumes that all the features are independent. This is actually a big con because features in reality are frequently not fully independent.
 Use case:
 When assumption of independence holds between features.
 Naive Bayes classifier will perform better than logistic regression if that holds true and it will require less training data.
 It performs well in case of categorical input variables compared to numerical variable(s)
 When assumption of independence holds between features.
Regression Algorithms
Linear Regression
 Linear regression analysis is used to predict the value of a variable based on the value of another variable.
 The variable you want to predict is called the dependent variable.
 The variable you are using to predict the other variable’s value is called the independent variable. Ref IBM
 Assumes a linear relationship occurs between two variables and fits a linear equation on the data.
 The goal of Linear Regression is to predict output values for inputs that are not present in the data set, with the belief that those outputs would fall on the line.
 Pros:
 Performs very well for linearly seperated data.
 Easy to implement and is interpretable.
 Cons:
 Prone to noise and overfitting.
 Very sensitive to outliers.
 Use case:
 Linear regression is commonly used for predictive analysis and modeling
Either Classification or Regression
KNearest Neighbor
 “Birds of a feather flock together”
 KNN is a supervised machine learning algorithm and KNN assumes that similar things exist in proximity.
 “K” here stands for a number of your choosing, it represents the number of neighbors you’d like to look into as an output.
 KNN answers the question that given the current data, what are the K most similar data points to the query.
 KNN will calculate distance most commonly using either euclidean or manhattan distance:
 Euclidean:
 \[d(x, y)=\sqrt{\sum_{i=1}^{n}\left(y_{i}x_{i}\right)^{2}}\]
 \[\text { Manhattan Distance }=d(x, y)=\left(\sum_{i=1}^{m}\leftx_{i}y_{i}\right\right)\]
 “The knearest neighbors algorithm, also known as KNN or kNN, is a nonparametric, supervised learning classifier, which uses proximity to make classifications or predictions about the grouping of an individual data point.
 While it can be used for either regression or classification problems, it is typically used as a classification algorithm, working off the assumption that similar points can be found near one another.” Ref: IBM
 This is the highlevel view of how the algorithm works:
 For each example in the data:
 Calculate distance b/w query example and current example form the data
 Add the distance and index to an ordered collection
 Sort in ascending order by distance
 Pick first K from sorted order
 Get labels of selected K entries
 If regression return mean of K labels
 If classification, return the model
 For each example in the data:
 Pros:
 Easy to implement
 Needs only a few hyperparameters which are:
 The value of K
 Distance metric used
 Cons:
 Does not scale well as it takes too much memory and data storage compared with other classifiers
 Prone to overfitting if the value of K is too low and will underfit if the value of K is too high.
 Use case:
 When labelled data is too expensive or impossible to obtain.
 When the dataset is relatively smaller and is noise free.
Support Vector Machine
 Objective of SVM is to find a hyperplane in an Ndimensional space(N number of features) that distinctly classifies the data points.
 Hyperplane is the decision boundary that helps classify the data points
 If the number of input features is 2, hyperplane is just a line, if input features is 3, it becomes a 2D plane.
 In the instance that the dimension is greater than 2, we need to use the kernel trick. The SVM kernel is a function that takes low dimensional input space and transforms it into higherdimensional space, ie it converts not separable problem to separable problem. It is mostly useful in nonlinear separation problems.
 The image below displays the linear hyperplane separating the two classes such that the hyperplane whose distance from it to the nearest data point on each side is maximized. This hyperplane is known as the maximummargin hyperplane/hard margin.
 Pros:
 Also works welll when there is a clear margin of separation between classes.
 Its easy to customize the kernel function depending on your dimensionality.
 Its memory efficient as it uses a subset of training points in the decision function called support vectors
 Cons:
 It doesn’t perform well when we have large data set because the required training time is higher.
 It also doesn’t perform very well, when the data set has more noise i.e. target classes are overlapping.
 Use case:
 While SVMs can be used for regression problems, its best used for classification.
 Works great if your number of features is high and in high dimensional spaces.
Decision Tree
 Like the name suggests, a Decision Tree is a tree with a flowchart like structure consisting of 3 elements.
 The internal node denotes a test on an attribute.
 Each branch represents an outcome of the test.
 Each leaf node (terminal node) holds a class label.
 The objective of a Decision Tree is to create a training model that can to predict the class of the target variable by learning simple decision rules inferred from prior data(training data).
 Pros:
 Interpretability is high due to the intuitive nature of a tree.
 Cons:
 Small changes in data can lead to large structural changes on the tree.
 Use case:
 When you want to be able to lay out all the possible outcomes of a problem and work on challenging each option.
XGBoost
 XGBoost algorithm is a gradient boosting algorithm that is highly efficient and scalable.
 Here’s a highlevel overview of the XGBoost algorithm:
 Initialize the model with a weak learner, usually a decision tree stump (a decision tree with a single split)
 Compute the negative gradient of the loss function with respect to the current prediction
 Fit a decision tree to the negative gradient to make a new prediction
 Add the prediction from this tree to the current prediction
 Repeat steps 24 for a specified number of trees, or until a stopping criterion is met
 Combine the predictions from all trees to get the final prediction
 Source
Gradient Boosting
 Gradient Boosting is an ensemble machine learning algorithm that combines multiple weak models to create a strong model.
 It is an iterative process where each iteration, a new model is fit to the residual errors made by the previous model, with the goal of decreasing the overall prediction error.
 The algorithm works as follows:
 Initialize the model with a weak learner, typically a decision tree with a single split.
 Compute the negative gradient of the loss function with respect to the current prediction.
 Fit a new model to the negative gradient.
 Update the prediction by adding the prediction from the new model.
 Repeat steps 24 for a specified number of iterations, or until a stopping criterion is met.
 Combine the predictions from all models to get the final prediction.