Logistic Regression Explained

by Gilbert Tanner on Sep 14, 2020 · 10 min read

Logistic Regression Explained

Logistic Regression is a classical statistical model, which has been widely used in academia and industry to solve binary classification problems.

In this article, I will walk you through the basics of a classification problem as well as how Logistic Regression work and how you can use it in Python.

Difference between Regression and Classification

Supervised Machine Learning can be split into two subcategories – Regression and Classification.

The difference between the two is that in Regression we are predicting a continuous number like the price of a house or the temperature for the next day whilst in Classification, we are predicting discrete values like if a patient has or doesn't have a heart disease.

Classification vs Regression
Figure 1: Classification vs Regression

Logistic Regression Theory

Logistic Regression is a statistical method that was designed to solve binary classification problems. It achieves this by passing the input through a linear function and then transforming the output to a probability value with the help of a sigmoid function.

Mathematically this looks like:

$$h_{\theta}(x)=P(Y=1|x;\theta)=sigmoid(Z)$$

$$Z=\theta^T x$$

Sigmoid Function

Sigmoid Function
Figure 2: Sigmoid Function (Source)

Without the Sigmoid function, Logistic Regression would just be Linear Regression. That means that the output of the model could range from -∞ to ∞.

That's fine when working on a regression task but for binary classification, the output needs to be a probability value. This is where the sigmoid function comes in. It squeezes the output of the linear function Z between 0 and 1. All input values greater than 0 produce an output greater than 0.5. All inputs less than 0 produce an output between less than 0.5.

Mathematically the sigmoid function looks like:

$$sigmoid(Z)=\frac{1}{1+e^{-Z}}$$

Decision Boundary

To get a discrete class value (either 0 or 1) a decision boundary must be chosen. The decision boundary specifies how high the probability must be so that we have an output of 1.

Generally, the decision boundary is 0.5 so that if the output is >=0.5 we get class 1, else class 0.

$$h_{\theta}(x)\geq0.5\rightarrow y=1$$

$$h_{\theta}(x)<0.5\rightarrow y=0$$

Loss Function

For Logistic Regression we can't use the same loss function as for Linear Regression because the Logistic Function (Sigmoid Function) will cause the output to be non-convex, which will cause many local optima.

Convex vs Non-Convex
Figure 3: Convex vs Non-Convex (Source)

Instead, we will use the following loss function for logistic regression:

$$\begin{align*}& J(\theta) = \dfrac{1}{m} \sum_{i=1}^m \mathrm{Cost}(h_\theta(x^{(i)}),y^{(i)}) \\ & \mathrm{Cost}(h_\theta(x),y) = -\log(h_\theta(x)) \; & \text{if y = 1} \\ & \mathrm{Cost}(h_\theta(x),y) = -\log(1-h_\theta(x)) \; & \text{if y = 0}\end{align*}$$

At first glance, the functions look complex but when visualized they are quite easy to grasp.

Cost Function for y=0 and y=1
Figure 4: Cost Function for y=0 and y=1

The above graph shows that the more of the prediction is from the actual y value the bigger the loss gets.

That means that if the correct answer is 0, then the cost function will be 0 if the prediction is also 0. If the prediction approaches 1, then the cost function will approach infinity.

If the correct answer is 1, then the cost function will be 0 if the prediction is 1. If the prediction approaches  0, then the cost function will approach infinity.

Simplifying the Loss Function

To make it easier to work with the loss function we can compress the two conditional cases into one equation:

$$J(\theta) = -\frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \; \log(h_\theta(x^{(i)})) + (1 - y^{(i)}) \log(1 - h_\theta(x^{(i)}))\right]$$

Notice that when y is equal to 1 the second term will be zero and therefore will not affect the loss.

One the other hand if y is equal to 0 the first term will be zero and therefore will not affect the loss.

Vectorized this looks like:

$$J(\theta)  = \frac{1}{m} \cdot \left(-y^{T}\log(h)-(1-y)^{T}\log(1-h)\right)$$

Gradient Descent

To find the coefficients (weights) that minimize the loss function we will use Gradient Descent. There are more sophisticated optimization algorithms out there such as Adam but we won't worry about those in this article.

Remember the general form of gradient descent looks like:

$$\begin{align*}& Repeat \; \lbrace \\ & \; \theta_j := \theta_j - \alpha \dfrac{\partial}{\partial \theta_j}J(\theta) \\ & \rbrace\end{align*}$$

We can get the gradient descent formula for Logistic Regression by taking the derivative of the loss function. This is quite involved therefore I will show you the result first and you can skip the process of getting to the result if you like.

Result:

$$\begin{align*}& Repeat \; \lbrace \\& \; \theta_j := \theta_j - \frac{\alpha}{m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)} \\ & \rbrace\end{align*}$$

Notice that the result is identical to the one of Linear Regression.

Deriving the Gradient Descent formula for Logistic Regression (Optional)

First we need to calculate the derivative of the sigmoid function. The derivative of the sigmoid function is quite easy to calulcate using the quotient rule.

$$\begin{align*}\sigma(x)'&=\left(\frac{1}{1+e^{-x}}\right)'=\frac{-(1+e^{-x})'}{(1+e^{-x})^2}=\frac{-1'-(e^{-x})'}{(1+e^{-x})^2}=\frac{0-(-x)'(e^{-x})}{(1+e^{-x})^2}=\frac{-(-1)(e^{-x})}{(1+e^{-x})^2}=\frac{e^{-x}}{(1+e^{-x})^2} \\ &=\left(\frac{1}{1+e^{-x}}\right)\left(\frac{e^{-x}}{1+e^{-x}}\right)=\sigma(x)\left(\frac{+1-1 + e^{-x}}{1+e^{-x}}\right)=\sigma(x)\left(\frac{1 + e^{-x}}{1+e^{-x}} - \frac{1}{1+e^{-x}}\right)=\sigma(x)(1 - \sigma(x))\end{align*}$$

Now we are ready to find out the partial derivative:

$$\begin{align*}\frac{\partial}{\partial \theta_j} J(\theta) &= \frac{\partial}{\partial \theta_j} \frac{-1}{m}\sum_{i=1}^m \left [ y^{(i)} log (h_\theta(x^{(i)})) + (1-y^{(i)}) log (1 - h_\theta(x^{(i)})) \right] \\&= - \frac{1}{m}\sum_{i=1}^m \left [     y^{(i)} \frac{\partial}{\partial \theta_j} log (h_\theta(x^{(i)}))   + (1-y^{(i)}) \frac{\partial}{\partial \theta_j} log (1 - h_\theta(x^{(i)}))\right] \\&= - \frac{1}{m}\sum_{i=1}^m \left [     \frac{y^{(i)} \frac{\partial}{\partial \theta_j} h_\theta(x^{(i)})}{h_\theta(x^{(i)})}   + \frac{(1-y^{(i)})\frac{\partial}{\partial \theta_j} (1 - h_\theta(x^{(i)}))}{1 - h_\theta(x^{(i)})}\right] \\&= - \frac{1}{m}\sum_{i=1}^m \left [     \frac{y^{(i)} \frac{\partial}{\partial \theta_j} \sigma(\theta^T x^{(i)})}{h_\theta(x^{(i)})}   + \frac{(1-y^{(i)})\frac{\partial}{\partial \theta_j} (1 - \sigma(\theta^T x^{(i)}))}{1 - h_\theta(x^{(i)})}\right] \\&= - \frac{1}{m}\sum_{i=1}^m \left [     \frac{y^{(i)} \sigma(\theta^T x^{(i)}) (1 - \sigma(\theta^T x^{(i)})) \frac{\partial}{\partial \theta_j} \theta^T x^{(i)}}{h_\theta(x^{(i)})}   + \frac{- (1-y^{(i)}) \sigma(\theta^T x^{(i)}) (1 - \sigma(\theta^T x^{(i)})) \frac{\partial}{\partial \theta_j} \theta^T x^{(i)}}{1 - h_\theta(x^{(i)})}\right] \\&= - \frac{1}{m}\sum_{i=1}^m \left [     \frac{y^{(i)} h_\theta(x^{(i)}) (1 - h_\theta(x^{(i)})) \frac{\partial}{\partial \theta_j} \theta^T x^{(i)}}{h_\theta(x^{(i)})}   - \frac{(1-y^{(i)}) h_\theta(x^{(i)}) (1 - h_\theta(x^{(i)})) \frac{\partial}{\partial \theta_j} \theta^T x^{(i)}}{1 - h_\theta(x^{(i)})}\right] \\&= - \frac{1}{m}\sum_{i=1}^m \left [     y^{(i)} (1 - h_\theta(x^{(i)})) x^{(i)}_j - (1-y^{(i)}) h_\theta(x^{(i)}) x^{(i)}_j\right] \\&= - \frac{1}{m}\sum_{i=1}^m \left [     y^{(i)} (1 - h_\theta(x^{(i)})) - (1-y^{(i)}) h_\theta(x^{(i)}) \right] x^{(i)}_j \\&= - \frac{1}{m}\sum_{i=1}^m \left [     y^{(i)} - y^{(i)} h_\theta(x^{(i)}) - h_\theta(x^{(i)}) + y^{(i)} h_\theta(x^{(i)}) \right] x^{(i)}_j \\&= - \frac{1}{m}\sum_{i=1}^m \left [ y^{(i)} - h_\theta(x^{(i)}) \right] x^{(i)}_j  \\&= \frac{1}{m}\sum_{i=1}^m \left [ h_\theta(x^{(i)}) - y^{(i)} \right] x^{(i)}_j\end{align*}$$

Multiclass Classification: One-vs-all

Eventhough Logistic Regression was created to solve binary classification problems it can also be used for more than two classes.

In this case, the problem is divided into n+1 subproblems (n+1=number of highest class +1).

One vs All split problem into subproblems.
Figure 5: One vs All split problem into subproblems.

In each subproblem we predict the probability that y is a member of one of our classes.

$$\begin{align*}& y \in \lbrace0, 1 ... n\rbrace \\& h_\theta^{(0)}(x) = P(y = 0 | x ; \theta) \\& h_\theta^{(1)}(x) = P(y = 1 | x ; \theta) \\& \cdots \\& h_\theta^{(n)}(x) = P(y = n | x ; \theta) \\& \mathrm{prediction} = \max_i( h_\theta ^{(i)}(x) )\\\end{align*}$$

We are basically choosing one class and putting all the others into a single class. We do this repeatedly until we went through each classes. Then we will find the result with the highest value and use this class as our prediction.

Regularized Logistic Regression

As you might know, regularization is a set of techniques designed to combat overfitting. Overfitting or also called high variance is the state when the model fits the available data but doesn't generalize well to unseen data. It is usually caused by an overcomplicated prediction function that creates lots of unnecessary curves and angles unrelated to the data.

Overfitting vs Underfitting
Figure 6: Overfitting vs Underfitting

There are two main options to address overfitting:

  • Reducing the number of features
  • Regularization

Manually reducing the number of features can be a tedious task because it often includes a lot of trial and error. Regularization, on the other hand, can happen automatically and has proven to be very reliable for lots of models over the years.

Logistic Regression can be regularized with the same techniques I explained when taking a look at Linear Regression – L1 and L2 Regularization.

Both techniques work by reducing the weights of the model by increasing their cost/loss. L1 Regularization takes the absolute values of all the weights and adds their sum to the loss. L2 Regularization sums up the squares instead of the absolute values.

For more information on the difference between L1 and L2 Regularization check out the following article:

L1 Regularization for Logistic Regression:

$$J(\theta) = -\frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \; \log(h_\theta(x^{(i)})) + (1 - y^{(i)}) \log(1 - h_\theta(x^{(i)}))\right] + \frac{\lambda}{2m}\sum_{j=1}^n\left|\theta_j\right|$$

L2 Regularization for Logistic Regression:

$$J(\theta) = -\frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \; \log(h_\theta(x^{(i)})) + (1 - y^{(i)}) \log(1 - h_\theta(x^{(i)}))\right] + \frac{\lambda}{2m}\sum_{j=1}^n\theta_j^2$$

As you can see above you not only can change from L1 to L2 Regularization but you can also increase or decrease the effect of regularization using $\lambda$.

Implementing Logistic Regression in Python

Now that you are familiar with the theory behind Logistic Regression implementing it in Python is quite straight forward.

For matrix math we use NumPy. If you aren't familiar with it I'd recommend you to get familiar with it before starting the implementation (good statring point).

import numpy as np


class LogisticRegression:

    def __init__(self, learning_rate, num_features, penalty='l2', C=0.1):
        self.learning_rate = learning_rate
        self.penalty = penalty
        self.C = C
        self.b = 0
        self.w = np.zeros((1, num_features))
        assert penalty in ['l2', 'l1', None]

    def sigmoid(self, x):
        return 1/(1+np.exp(-x))

    def cost_function(self, y, y_pred):
        y_T = y.T
        if self.penalty == 'l1':
            return (-1/y.shape[0]) * (np.sum((y_T*np.log(y_pred)) + ((1-y_T) * np.log(1-y_pred))) + self.C * np.sum(np.absolute(self.w)))
        elif self.penalty == 'l2':
            return (-1/y.shape[0]) * (np.sum((y_T*np.log(y_pred)) + ((1-y_T) * np.log(1-y_pred))) + self.C * np.sum(np.square(self.w)))
        else:
            return (-1/y.shape[0]) * (np.sum((y_T*np.log(y_pred)) + ((1-y_T) * np.log(1-y_pred))))

    def fit(self, X, y, num_iterations):
        for i in range(num_iterations):
            pred = self.sigmoid(np.dot(self.w, X.T) + self.b)
            cost = self.cost_function(y, pred)

            # Calculate Gradients/Derivatives
            dw = (1 / X.shape[0]) * (np.dot(X.T, (pred - y.T).T))
            db = (1 / X.shape[0]) * (np.sum(pred - y.T))

            self.w = self.w - (self.learning_rate * dw.T)
            self.b = self.b - (self.learning_rate * db)

            #if i % 100 == 0:
                #print('Error:', cost)
        return self

    def predict(self, X):
        predictions = self.sigmoid(np.dot(self.w, X.T) + self.b)[0]
        return [1 if pred >= 0.5 else 0 for pred in predictions]

    def predict_proba(self, X):
        predictions = self.sigmoid(np.dot(self.w, X.T) + self.b)[0]
        return predictions

The above 48 lines of code are all that is needed to implement Logistic Regression in Python. The implementation also includes L1 and L2 regularization which can be selected through the penalty argument.

Implementing Multiclass Logistic Regression

Now that we have Logistic Regression up and running it's really simple to extend it for multiclass compatibility.

If a data-set has more than two classes we just need to create multiple binary Logistic Regression models and train each on one class of the data. This can be achieved with the following code:  

import numpy as np
from logistic_regression import LogisticRegression


class LogisticRegressionOneVsAll:

    def __init__(self, learning_rate, num_features, num_classes):
        self.models = [LogisticRegression(learning_rate, num_features) for _ in range(num_classes)]
        

    def fit(self, X, y, num_iterations):
        for i, model in enumerate(self.models):
            y_tmp = (y==i).astype(int)
            model.fit(X, y_tmp, num_iterations)

    def predict(self, X):
        predictions = np.array([model.predict_proba(X) for model in self.models])
        return np.argmax(predictions, axis=0)

Credits/Other great resources

Recap

Logistic Regression has and is still widely used by many different people. This is because of its simplicity which makes it easy to implement and really fast. It is mostly used for binary classification but it can also be used for more than two classes.

Where it struggles are difficult classification problems with lots of features as well as feature interactions, which have to be added manually.

As a rule of thumb, it's always worth it to try Logistic Regression because it can serve as a really good baseline which you can build from.

Conclusion

That's it from this article. All the code from the article can be found on my Github. If you have any questions or feedback feel free to contact me on Twitter or through my contact form. If you want to get continuous updates about my blog make sure to join my newsletter.