Introduction to Ensemble Learning

Introduction to Ensemble Learning

Nobody can know everything but with help, we can overcome every obstacle. That’s exactly the idea behind ensemble learning. Eventhough, individual models might produce weak results combined they might be unbeatable.

And ensemble models are exactly that — models consisting of a combination of base models. The only difference being the way they combine the models which can range from simple methods like averaging or max voting to more complex like Boosting or stacking.

Ensemble learning techniques have seen a huge jump in popularity in the last years. This is because they can help you to build a really robust model from a few “weak” models, which eliminates a lot of the model tuning that would else be needed to achieve good results. They are especially popular in data science competitions because for these competitions the highest accuracy is more important than the runtime or interpretability of the model. This most often isn’t the case in the industry but that doesn’t mean that ensembling can’t be of use for such applications. Rather it means that it isn’t used at the same scale for industry problems as it is for data science competitions.

In this article, we will go through the most  common ensembling techniques out there. You will learn how they work and how you can use them with Python. We will work on the House Price Regression data-set which can be freely downloaded from Kaggle.

Also, the full code for both regression and classification is available on my Github. If you have any questions or recommendations feel free to leave a comment down below or contact me on social media.

Importing libraries and loading data-set

After downloading the data-set we can import everything needed and load in the data-set with the following code.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import pathlib
from scipy import stats
from scipy.stats import norm, skew

from sklearn.linear_model import Lasso
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler

import xgboost as xgb
import lightgbm as lgb

import warnings
warnings.filterwarnings("ignore")

path = pathlib.Path('<path of data-set>')

# Load in the train and test dataset
train = pd.read_csv(path/'train.csv')
test = pd.read_csv(path/'test.csv')

train.head()
Initial data-set
Figure 2: House Price Regression data-set

Feature Engineering

This data-set has quite a few interesting features and therefore we can get huge accuracy gains by creating the right features but because this isn’t the topic of this article we won’t go into further detail.

If you are still interested in the feature engineering I would recommend you to check out this excellent Kaggle kernel, which not only includes excellent feature engineering but also shows how to do stacking.

After executing all the feature engineering code from the kernel we have a data-set with 221 columns — a mixture of categorical and continues once.

Cleaned data-set
Figure 2: Data-set after feature engineering

Base Models

Before we can start working through the different ensembling techniques we need to define some base models which will be used for ensembling. For this data-set, we will use a Lasso  regression, GradientBoosting, XGBoost, and lightGBM model.

We will validate the results using the root mean squared error and cross-validation.

from sklearn.model_selection import KFold, cross_val_score
n_folds = 5 # number of folds
def get_cv_scores(model, X, y, print_scores=True):
    kf = KFold(n_folds, shuffle=True, random_state=42).get_n_splits(X) # create folds
    rmse = np.sqrt(-cross_val_score(model, X, y, scoring="neg_mean_squared_error", cv = kf)) # get rmse
    if print_scores:
        print(f'Root mean squared error: {rmse.mean():.3f} ({rmse.std():.3f})')
    return [rmse]

Now we will quickly define and train each model so we can get an idea about their base performance.

# create models
lasso_model = make_pipeline(RobustScaler(), Lasso(alpha=0.0005, random_state=1))
rf = RandomForestRegressor()
gbr = GradientBoostingRegressor()
xgb_model = xgb.XGBRegressor()
lgb_model = lgb.LGBMRegressor()
for model in [lasso_model, rf, gbr, xgb_model, lgb_model]:
    get_cv_scores(model, X, y_train)

This will output the root mean squared error for each of our models:

Lasso Regression: 0.124 (0.016)
RandomForestRegressor: 0.152 (0.009)
GradientBoostingRegressor: 0.126 (0.009)
XGBRegressor: 0.128 (0.008)
LGBMRegressor: 0.133 (0.009)

Ensembling Approaches

As mentioned at the start of the article there are multiple ways to ensemble models. There are simple once like max voting or averaging as well as more complex once like boosting, bagging or stacking.

What is the same for each of them is that they hugely benefit from uncorrelated base models — models that make very different predictions. To explain why this is the case let us work through a little example from the Kaggle Ensembling Guide.

Imagine we have a data-set with all 1s as the ground truth targets. We could have 3 highly correlated models which produce the following predictions:

1111111100 = 80% accuracy
1111111100 = 80% accuracy
1011111100 = 70% accuracy

When we take a simple majority vote — choosing the value that appears most — we see no improvement:

1111111100 = 80% accuracy

This is because these models all learned the same thing and therefore they are also making the same mistakes. But if we instead use 3  less-performing highly uncorrelated models we can see that accuracy increases significantly:

1111111100 = 80% accuracy
0111011101 = 70% accuracy
1000101111 = 60% accuracy

Applying majority vote:

1111111101 = 90% accuracy

This is a huge improvement from the 60–80% accuracy of the base models.

Now that we have an understanding of what ensembling is and have our base models ready we can start working through the different ensembling techniques.

Averaging

Averaging is a simple method that is generally used for regression problems. Here the predictions are simply averaged to get a more robust result and even though this method is simple it almost always gives better results than a single model and therefore is always worth trying.

For ease of usage, we will create a class that inherits from Scikit-Learns BaseEstimator, RegressorMixin, and TransformerMixin classes because that way we only need to overwrite the fit and predict methods to get a working model which can be used like any other Scikit-Learn model.

class AveragedModels(BaseEstimator, RegressorMixin, TransformerMixin):
    def __init__(self, models):
        self.models = models
        
    def fit(self, X, y):
        self.models_ = [clone(x) for x in self.models]
        
        # Train cloned base models
        for model in self.models_:
            model.fit(X, y)
        return self
    
    def predict(self, X):
        predictions = np.column_stack([
            model.predict(X) for model in self.models_
        ])
        return np.mean(predictions, axis=1)

The model can now be used like any other Scikit Learn model. The only difference is that when creating an object we need to pass the base models which will then be trained in the fit method and used for predictions in the prediction method.

%%time
averaged_model1 = AveragedModels([gbr, lasso_model, xgb_model])
get_cv_scores(averaged_model1, X, y_train);

By averaging three of our base models — Lasso regression, GBM, and XGBoost — we get a significant accuracy increase.

Root mean squared error loss:
Lasso Regression: 0.124 (0.016)
GradientBoostingRegressor: 0.126 (0.009)
XGBRegressor: 0.128 (0.008)
Averaged Model: 0.119 (0.009)

Weighted Average

Averaging models is great and simple but it has one major flaw and that is that most of the time one model has more predictive power than another and therefore we want to give it more weight on the final predictions.

We can achieve this by passing a weight for each of the models and then multiplying the predictions of each model with the corresponding weight.  The only thing we need to look out for is that the weights need to add up to 1 in order not to change the scale of the predictions.

class WeightedAveragedModels(BaseEstimator, RegressorMixin, TransformerMixin):
    def __init__(self, models, weights):
        self.models = models
        self.weights = weights
        assert sum(self.weights)==1
        
    def fit(self, X, y):
        self.models_ = [clone(x) for x in self.models]
        
        # Train cloned base models
        for model in self.models_:
            model.fit(X, y)
        return self
    
    def predict(self, X):
        predictions = np.column_stack([
            model.predict(X) for model in self.models_
        ])
        return np.sum(predictions*self.weights, axis=1)

To now create the model we not only need to pass the base models but also an array of weights.

weighted_average_model = WeightedAveragedModels([gbr, lasso_model, xgb_model], [0.3, 0.45, 0.25])
get_cv_scores(weighted_average_model, X, y_train);

This even further reduces the error to an average of 0.118 with a standard deviation of 0.009 over the different folds.

Bagging

Bagging is a hugely popular ensembling method which is used in algorithms like Random Forest. It gains accuracy by not only averaging the models but also trying to create models that are as uncorrelated as possible by giving them different training sets.

It creates the data-set using sampling with replacement a simple but sufficient data sampling technique. To implement this we will create a method called subsample and call it for every model to create its individual data-set.

class BaggingModels(BaseEstimator, RegressorMixin, TransformerMixin):
    def __init__(self, models):
        self.models = models
        
    def fit(self, X, y):
        self.models_ = [clone(x) for x in self.models]
        for model in self.models_:
            X_tmp, y_tmp = self.subsample(X, y)
            model.fit(X_tmp, y_tmp)
        
        return self
            
    # Create a random subsample from the dataset with replacement
    def subsample(self, X, y, ratio=1.0):
        X_new, y_new = list(), list()
        n_sample = round(len(X) * ratio)
        while len(X_new) < n_sample:
            index = np.random.randint(len(X))
            X_new.append(X[index])
            y_new.append(y[index])
        return X_new, y_new
    
    def predict(self, X):
        predictions = np.column_stack([
            model.predict(X) for model in self.models_
        ])
        return np.mean(predictions, axis=1)

With this technique, we are able to create models which are highly uncorrelated and therefore perform really well when ensembled as can be seen when we compare a single decision tree and a random forest.

dt = DecisionTreeRegressor()
rf = RandomForestRegressor()
for model in [dt, rf]:
   get_cv_scores(model, X, y_train);

Here we can clearly see the power of bagging:

DecisionTree: 0.213 (0.019)
RandomForest: 0.154 (0.008)

Boosting

Boosting  is a sequential process, where each subsequent model tries to correct the errors of the previous model. Therefore the succeeding models are dependent on the previous models and we need to train the models in sequence instead of parallel.

Boosting follows the following steps:

  1. Creates a subset from the original dataset (Initially, all datapoints are weighted equally)
  2. Creates and trains a base-model on the subset
  3. The base model is used to make predictions on the whole dataset
  4. Errors of the predictions are calculated
  5. Incorrectly predicted datapoints (datapoints with bigger error) are given higher weights
  6. Another model is created on the datapoints with high weights
  7. Steps are repeated as long as needed
  8. The final model is the weighted average of all models (better models get higher weights)

One of the first popular boosting implementations is called AdaBoost. To create an AdaBoost model we can simply use the AdaBoostRegressor model from Scikit-Learn.

from sklearn.ensemble import AdaBoostRegressor
from sklearn.tree import DecisionTreeRegressor
model = AdaBoostRegressor(base_estimator=DecisionTreeRegressor(max_depth=3), n_estimators=50)

Stacking

Stacking was introduced by Wolpert in the paper Stacked Generalization in 1992. It is a method that uses k-fold for training base models which then make predictions on the left out fold. These so-called out of fold predictions are then used to train another model — the meta model — which can use the information produced by the base models to make final predictions.

To implement this functionality we need to first train each base model k times (k…Number of folds) and then use their predictions to train our meta model.

To even get better results we can not only use the predictions of all the base models for  training the meta model but also the initial features. Because of the added model complexity which is caused when adding the input features we should make a boolean parameter to determine whether we want to use input features.

class StackingAveragedModels(BaseEstimator, RegressorMixin, TransformerMixin):
    def __init__(self, base_models, meta_model, n_folds=5, use_features_in_secondary=False):
        self.base_models = base_models
        self.meta_model = meta_model
        self.n_folds = n_folds
        self.use_features_in_secondary = use_features_in_secondary
        
    def fit(self, X, y):
        """Fit all the models on the given dataset"""
        self.base_models_ = [list() for x in self.base_models]
        self.meta_model_ = clone(self.meta_model)
        kfold = KFold(n_splits=self.n_folds, shuffle=True, random_state=42)
        
        # Train cloned base models and create out-of-fold predictions
        out_of_fold_predictions = np.zeros((X.shape[0], len(self.base_models)))
        for i, model in enumerate(self.base_models):
            for train_index, holdout_index in kfold.split(X, y):
                instance = clone(model)
                self.base_models_[i].append(instance)
                instance.fit(X[train_index], y[train_index])
                y_pred = instance.predict(X[holdout_index])
                out_of_fold_predictions[holdout_index, i] = y_pred
        
        if self.use_features_in_secondary:
            self.meta_model_.fit(np.hstack((X, out_of_fold_predictions)), y)
        else:
            self.meta_model_.fit(out_of_fold_predictions, y)
            
        return self
    
    def predict(self, X):
        meta_features = np.column_stack([
            np.column_stack([model.predict(X) for model in base_models]).mean(axis=1)
            for base_models in self.base_models_ ])
        if self.use_features_in_secondary:
            return self.meta_model_.predict(np.hstack((X, meta_features)))
        else:
            return self.meta_model_.predict(meta_features)

Now that our StackingAveragedModels is ready we can create and train a stacked average model with the following code:

# use_features_in_secondary=False
stacking_model1 = StackingAveragedModels([gbr, lgb_model, xgb_model], lasso_model)
get_cv_scores(stacking_model1, X, y_train);
# use_features_in_secondary=True
stacking_model2= StackingAveragedModels([gbr, lgb_model, xgb_model], lasso_model, use_features_in_secondary=True)
get_cv_scores(stacking_model2, X, y_train);

By using a Gradient Boosting Machine, LightGBM and XGBoost as the base models and lasso regression as a meta model we achieve a loss of 0.124.  By including the input features we further reduced it to 0.120.

For this problem, this doesn’t seem like that big of a deal when knowing that we achieved better result using our simple weighted average model but for larger data-sets both stacking and blending can give you really good results. Furthermore, you could extend this implementation so you could use multiple layers which could even further reduce the error.

Blending

Blending is a word introduced by the Netflix competition winners. It is very similar to stacking with to only difference being that instead of creating out-of-fold predictions using kfold you create a small holdout data-set which will then be used to train the meta-model.

This offers a few benefits:

  • Simpler than Stacking
  • Faster than Stacking
  • Prevents information leakage

Cons:

  • You use less data overall
  • Meta-model may overfit to the holdout data-set
  • More inaccurate accuracy measure than with stacking (because of the use of a holdout set instead of kfold)
from sklearn.model_selection import train_test_split
class BlendingAveragedModels(BaseEstimator, RegressorMixin, TransformerMixin):
    def __init__(self, base_models, meta_model, holdout_pct=0.2, use_features_in_secondary=False):
        self.base_models = base_models
        self.meta_model = meta_model
        self.holdout_pct = holdout_pct
        self.use_features_in_secondary = use_features_in_secondary
        
    def fit(self, X, y):
        self.base_models_ = [clone(x) for x in self.base_models]
        self.meta_model_ = clone(self.meta_model)
        
        X_train, X_holdout, y_train, y_holdout = train_test_split(X, y, test_size=self.holdout_pct)
                
        holdout_predictions = np.zeros((X_holdout.shape[0], len(self.base_models)))
        for i, model in enumerate(self.base_models_):
            model.fit(X_train, y_train)
            y_pred = model.predict(X_holdout)
            holdout_predictions[:, i] = y_pred
        if self.use_features_in_secondary:
            self.meta_model_.fit(np.hstack((X_holdout, holdout_predictions)), y_holdout)
        else:
            self.meta_model_.fit(holdout_predictions, y_holdout)
            
        return self
    
    def predict(self, X):
        meta_features = np.column_stack([
            model.predict(X) for model in self.base_models_
        ])
        if self.use_features_in_secondary:
            return self.meta_model_.predict(np.hstack((X, meta_features)))
        else:
            return self.meta_model_.predict(meta_features)

This model can be trained the same as our StackingAveragedModel. We only need to pass it some base models and a meta model.

blending_model1 = BlendingAveragedModels([gbr, lgb_model, xgb_model, lasso_model], lasso_model)
get_cv_scores(blending_model1, X, y_train);

This gives us a root mean squared error of 0.120 which is almost the same accuracy we got using stacking. The big difference between Stacking and Blending can be seen when printing the training time. Whilst the Stacking model trained for about 42 seconds on my notebook the Blending model only trained for about 8 seconds.

Conclusion

Ensembling Learning is a hugely effective way to improve the accuracy of your Machine Learning problem. It consists of a lot of different methods which range from the easy to implement and simple to use averaging approach to more advanced techniques like stacking and blending.

In this article, we learned the basics workings of these different ensembling techniques and how to implement them in Python.

That’s all from this article. If you have any questions or just want to chat with me feel free to leave a comment below or contact me on social media. If you want to get continuous updates about my blog make sure to follow me on Medium and join my newsletter.