• Strength in Numbers: Ensembling Models with Bagging and Boosting

    Bagging and boosting are two powerful ensemble techniques in machine learning – they are must-knows for data scientists! After reading this article, you are going to have a solid understanding of how bagging and boosting work and when to use them. We’ll cover the following topics, relying heavily on examples to give hands-on illustration of the key concepts:

    How Ensembling helps create powerful models

    Bagging: Adding stability to ML models

    Boosting: Reducing bias in weak learners

    Bagging vs. Boosting – when to use each and why

    Creating powerful models with ensembling

    In Machine Learning, ensembling is a broad term that refers to any technique that creates predictions by combining the predictions from multiple models. If there is more than one model involved in making a prediction, the technique is using ensembling!

    Ensembling approaches can often improve the performance of a single model. Ensembling can help reduce:

    Variance by averaging multiple models

    Bias by iteratively improving on errors

    Overfitting because using multiple models can increase robustness to spurious relationships

    Bagging and boosting are both ensemble methods that can perform much better than their single-model counterparts. Let’s get into the details of these now!

    Bagging: Adding stability to ML models

    Bagging is a specific ensembling technique that is used to reduce the variance of a predictive model. Here, I’m talking about variance in the machine learning sense – i.e., how much a model varies with changes to the training dataset – not variance in the statistical sense which measures the spread of a distribution. Because bagging helps reduce an ML model’s variance, it will often improve models that are high variancebut won’t do much good for models that are low variance.

    Now that we understand when bagging helps, let’s get into the details of the inner workings to understand how it helps! The bagging algorithm is iterative in nature – it builds multiple models by repeating the following three steps:

    Bootstrap a dataset from the original training data

    Train a model on the bootstrapped dataset

    the trained model

    The collection of models created in this process is called an ensemble. When it is time to make a prediction, each model in the ensemble makes its own prediction – the final bagged prediction is the averageor majority voteof all of the ensemble’s predictions.

    Now that we understand how bagging works, let’s take a few minutes to build an intuition for why it works. We’ll borrow a familiar idea from traditional statistics: sampling to estimate a population mean.

    In statistics, each sample drawn from a distribution is a random variable. Small sample sizes tend to have high variance and may provide poor estimates of the true mean. But as we collect more samples, the average of those samples becomes a much better approximation of the population mean.

    Similarly, we can think of each of our individual decision trees as a random variable — after all, each tree is trained on a different random sample of the data! By averaging predictions from many trees, bagging reduces variance and produces an ensemble model that better captures the true relationships in the data.

    Bagging Example

    We will be using the load_diabetes1 dataset from the scikit-learn Python package to illustrate a simple bagging example. The dataset has 10 input variables – Age, Sex, BMI, Blood Pressure and 6 blood serum levels. And a single output variable that is a measurement of disease progression. The code below pulls in our data and does some very simple cleaning. With our dataset established, let’s start modeling!

    # pull in and format data
    from sklearn.datasets import load_diabetes

    diabetes = load_diabetesdf = pd.DataFramedf.loc= diabetes.target
    df = df.dropnaFor our example, we will use basic decision trees as our base models for bagging. Let’s first verify that our decision trees are indeed high variance. We will do this by training three decision trees on different bootstrapped datasets and observing the variance of the predictions for a test dataset. The graph below shows the predictions of three different decision trees on the same test dataset. Each dotted vertical line is an individual observation from the test dataset. The three dots on each line are the predictions from the three different decision trees.

    Variance of decision trees on test data points – image by author

    In the chart above, we see that individual trees can give very different predictionswhen trained on bootstrapped datasets. This is the variance we have been talking about!

    Now that we see that our trees aren’t very robust to training samples – let’s average the predictions to see how bagging can help! The chart below shows the average of the three trees. The diagonal line represents perfect predictions. As you can see, with bagging, our points are tighter and more centered around the diagonal.

    image by author

    We’ve already seen significant improvement in our model with the average of just three trees. Let’s beef up our bagging algorithm with more trees!

    Here is the code to bag as many trees as we want:

    def train_bagging_trees:

    '''
    Creates a decision tree bagging model by training multiple
    decision trees on bootstrapped data.

    inputs
    df: training data with both target and input columns
    target_col: name of target column
    pred_cols: list of predictor column names
    n_trees: number of trees to be trained in the ensemble

    output:
    train_trees: list of trained trees

    '''

    train_trees =for i in range:

    # bootstrap training data
    temp_boot = bootstrap#train tree
    temp_tree = plain_vanilla_tree# save trained tree in list
    train_trees.appendreturn train_trees

    def bagging_trees_pred:

    '''
    Takes a list of bagged trees and creates predictions by averaging
    the predictions of each individual tree.

    inputs
    df: training data with both target and input columns
    train_trees: ensemble model - which is a list of trained decision trees
    target_col: name of target column
    pred_cols: list of predictor column names

    output:
    avg_preds: list of predictions from the ensembled trees

    '''

    x = dfy = dfpreds =# make predictions on data with each decision tree
    for tree in train_trees:
    temp_pred = tree.predictpreds.append# get average of the trees' predictions
    sum_preds =avg_preds =return avg_preds

    The functions above are very simple, the first trains the bagging ensemble model, the second takes the ensembleand makes predictions given a dataset.

    With our code established, let’s run multiple ensemble models and see how our out-of-bag predictions change as we increase the number of trees.

    Out-of-bag predictions vs. actuals colored by number of bagged trees – image by author

    Admittedly, this chart looks a little crazy. Don’t get too bogged down with all of the individual data points, the lines dashed tell the main story! Here we have 1 basic decision tree model and 3 bagged decision tree models – with 3, 50 and 150 trees. The color-coded dotted lines mark the upper and lower ranges for each model’s residuals. There are two main takeaways here:as we add more trees, the range of the residuals shrinks andthere is diminishing returns to adding more trees – when we go from 1 to 3 trees, we see the range shrink a lot, when we go from 50 to 150 trees, the range tightens just a little.

    Now that we’ve successfully gone through a full bagging example, we are about ready to move onto boosting! Let’s do a quick overview of what we covered in this section:

    Bagging reduces variance of ML models by averaging the predictions of multiple individual models

    Bagging is most helpful with high-variance models

    The more models we bag, the lower the variance of the ensemble – but there are diminishing returns to the variance reduction benefit

    Okay, let’s move on to boosting!

    Boosting: Reducing bias in weak learners

    With bagging, we create multiple independent models – the independence of the models helps average out the noise of individual models. Boosting is also an ensembling technique; similar to bagging, we will be training multiple models…. But very different from bagging, the models we train will be dependent. Boosting is a modeling technique that trains an initial model and then sequentially trains additional models to improve the predictions of prior models. The primary target of boosting is to reduce bias – though it can also help reduce variance.

    We’ve established that boosting iteratively improves predictions – let’s go deeper into how. Boosting algorithms can iteratively improve model predictions in two ways:

    Directly predicting the residuals of the last model and adding them to the prior predictions – think of it as residual corrections

    Adding more weight to the observations that the prior model predicted poorly

    Because boosting’s main goal is to reduce bias, it works well with base models that typically have more bias. For our examples, we are going to use shallow decision trees as our base model – we will only cover the residual prediction approach in this article for brevity. Let’s jump into the boosting example!

    Predicting prior residuals

    The residuals prediction approach starts off with an initial modeland we calculate the residuals of that initial prediction. The second model in the ensemble predicts the residuals of the first model. With our residual predictions in-hand, we add the residual predictions to our initial predictionand recalculate the updated residuals…. we continue this process until we have created the number of base models we specified. This process is pretty simple, but is a little hard to explain with just words – the flowchart below shows a simple, 4-model boosting algorithm.

    Flowchart of simple, 4 model boosting algorithm – image by author

    When boosting, we need to set three main parameters:the number of trees,the tree depth andthe learning rate. I’ll spend a little time discussing these inputs now.

    Number of Trees

    For boosting, the number of trees means the same thing as in bagging – i.e., the total number of trees that will be trained for the ensemble. But, unlike boosting, we should not err on the side of more trees! The chart below shows the test RMSE against the number of trees for the diabetes dataset.

    Unlike with bagging, too many trees in boosting leads to overfitting! – image by author

    This shows that the test RMSE drops quickly with the number of trees up until about 200 trees, then it starts to creep back up. It looks like a classic ‘overfitting’ chart – we reach a point where more trees becomes worse for the model. This is a key difference between bagging and boosting – with bagging, more trees eventually stop helping, with boosting more trees eventually start hurting!

    With bagging, more trees eventually stops helping, with boosting more trees eventually starts hurting!

    We now know that too many trees are bad, and too few trees are bad as well. We will use hyperparameter tuning to select the number of trees. Note – hyperparameter tuning is a huge subject and way outside of the scope of this article. I’ll demonstrate a simple grid search with a train and test dataset for our example a little later.

    Tree Depth

    This is the maximum depth for each tree in the ensemble. With bagging, trees are often allowed to go as deep they want because we are looking for low bias, high variance models. With boosting however, we use sequential models to address the bias in the base learners – so we aren’t as concerned about generating low-bias trees. How do we decide how the maximum depth? The same technique that we’ll use with the number of trees, hyperparameter tuning.

    Learning Rate

    The number of trees and the tree depth are familiar parameters from bagging– but this ‘learning rate’ character is a new face! Let’s take a moment to get familiar. The learning rate is a number between 0 and 1 that is multiplied by the current model’s residual predictions before it is added to the overall predictions.

    Here’s a simple example of the prediction calculations with a learning rate of 0.5. Once we understand the mechanics of how the learning rate works, we will discuss the why the learning rate is important.

    The learning rate discounts the residual prediction before updating the actual target prediction – image by author

    So, why would we want to ‘discount’ our residual predictions, wouldn’t that make our predictions worse? Well, yes and no. For a single iteration, it will likely make our predictions worse – but, we are doing multiple iterations. For multiple iterations, the learning rate keeps the model from overreacting to a single tree’s predictions. It will probably make our current predictions worse, but don’t worry, we will go through this process multiple times! Ultimately, the learning rate helps mitigate overfitting in our boosting model by lowering the influence of any single tree in the ensemble. You can think of it as slowly turning the steering wheel to correct your driving rather than jerking it. In practice, the number of trees and the learning rate have an opposite relationship, i.e., as the learning rate goes down, the number of trees goes up. This is intuitive, because if we only allow a small amount of each tree’s residual prediction to be added to the overall prediction, we are going to need a lot more trees before our overall prediction will start looking good.

    Ultimately, the learning rate helps mitigate overfitting in our boosting model by lowering the influence of any single tree in the ensemble.

    Alright, now that we’ve covered the main inputs in boosting, let’s get into the Python coding! We need a couple of functions to create our boosting algorithm:

    Base decision tree function – a simple function to create and train a single decision tree. We will use the same function from the last section called ‘plain_vanilla_tree.’

    Boosting training function – this function sequentially trains and updates residuals for as many decision trees as the user specifies. In our code, this function is called ‘boost_resid_correction.’

    Boosting prediction function – this function takes a series of boosted models and makes final ensemble predictions. We call this function ‘boost_resid_correction_pred.’

    Here are the functions written in Python:

    # same base tree function as in prior section
    def plain_vanilla_tree:

    X_train = df_trainy_train = df_traintree = DecisionTreeRegressorif weights:
    tree.fitelse:
    tree.fitreturn tree

    # residual predictions
    def boost_resid_correction:
    '''
    Creates boosted decision tree ensemble model.
    Inputs:
    df_train: contains training data
    target_col: name of target column
    pred_col: target column names
    num_models: number of models to use in boosting
    learning_rate: discount given to residual predictions
    takes values between: max depth of each tree model

    Outputs:
    boosting_model: contains everything needed to use model
    to make predictions - includes list of all
    trees in the ensemble
    '''

    # create initial predictions
    model1 = plain_vanilla_treeinitial_preds = model1.predictdf_train= df_train- initial_preds

    # create multiple models, each predicting the updated residuals
    models =for i in range:
    temp_model = plain_vanilla_treemodels.appendtemp_pred_resids = temp_model.predictdf_train= df_train-boosting_model = {'initial_model' : model1,
    'models' : models,
    'learning_rate' : learning_rate,
    'pred_cols' : pred_cols}

    return boosting_model

    # This function takes the residual boosted model and scores data
    def boost_resid_correction_predict:

    '''
    Creates predictions on a dataset given a boosted model.

    Inputs:
    df: data to make predictions
    boosting_models: dictionary containing all pertinent
    boosted model data
    chart: indicates if performance chart should
    be created
    Outputs:
    pred: predictions from boosted model
    rmse: RMSE of predictions
    '''

    # get initial predictions
    initial_model = boosting_modelspred_cols = boosting_modelspred = initial_model.predict# calculate residual predictions from each model and add
    models = boosting_modelslearning_rate = boosting_modelsfor model in models:
    temp_resid_preds = model.predictpred += learning_rate*temp_resid_preds

    if chart:
    plt.scatterplt.showrmse = np.sqrt)

    return pred, rmse

    Sweet, let’s make a model on the same diabetes dataset that we used in the bagging section. We’ll do a quick grid searchto tune our three parameters and then we’ll train the final model using the boost_resid_correction function.

    # tune parameters with grid search
    n_trees =learning_rates =max_depths = my_list = list)

    # Create a dictionary to hold test RMSE for each 'square' in grid
    perf_dict = {}
    for tree in n_trees:
    for learning_rate in learning_rates:
    for max_depth in max_depths:
    temp_boosted_model = boost_resid_correctiontemp_boosted_model= 'target'
    preds, rmse = boost_resid_correction_predictdict_key = '_'.joinfor x in)
    perf_dict= rmse

    min_key = minprintAnd our winner is — 50 trees, a learning rate of 0.1 and a max depth of 1! Let’s take a look and see how our predictions did.

    Tuned boosting actuals vs. residuals – image by author

    While our boosting ensemble model seems to capture the trend reasonably well, we can see off the bat that it isn’t predicting as well as the bagging model. We could probably spend more time tuning – but it could also be the case that the bagging approach fits this specific data better. With that said, we’ve now earned an understanding of bagging and boosting – let’s compare them in the next section!

    Bagging vs. Boosting – understanding the differences

    We’ve covered bagging and boosting separately, the table below brings all the information we’ve covered to concisely compare the approaches:

    image by author

    Note: In this article, we wrote our own bagging and boosting code for educational purposes. In practice you will just use the excellent code that is available in Python packages or other software. Also, people rarely use ‘pure’ bagging or boosting – it is much more common to use more advanced algorithms that modify the plain vanilla bagging and boosting to improve performance.

    Wrapping it up

    Bagging and boosting are powerful and practical ways to improve weak learners like the humble but flexible decision tree. Both approaches use the power of ensembling to address different problems – bagging for variance, boosting for bias. In practice, pre-packaged code is almost always used to train more advanced machine learning models that use the main ideas of bagging and boosting but, expand on them with multiple improvements.

    I hope that this has been helpful and interesting – happy modeling!

    Dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases and is distributed under the public domain license for use without restriction.

    The post Strength in Numbers: Ensembling Models with Bagging and Boosting appeared first on Towards Data Science.
    #strength #numbers #ensembling #models #with
    Strength in Numbers: Ensembling Models with Bagging and Boosting
    Bagging and boosting are two powerful ensemble techniques in machine learning – they are must-knows for data scientists! After reading this article, you are going to have a solid understanding of how bagging and boosting work and when to use them. We’ll cover the following topics, relying heavily on examples to give hands-on illustration of the key concepts: How Ensembling helps create powerful models Bagging: Adding stability to ML models Boosting: Reducing bias in weak learners Bagging vs. Boosting – when to use each and why Creating powerful models with ensembling In Machine Learning, ensembling is a broad term that refers to any technique that creates predictions by combining the predictions from multiple models. If there is more than one model involved in making a prediction, the technique is using ensembling! Ensembling approaches can often improve the performance of a single model. Ensembling can help reduce: Variance by averaging multiple models Bias by iteratively improving on errors Overfitting because using multiple models can increase robustness to spurious relationships Bagging and boosting are both ensemble methods that can perform much better than their single-model counterparts. Let’s get into the details of these now! Bagging: Adding stability to ML models Bagging is a specific ensembling technique that is used to reduce the variance of a predictive model. Here, I’m talking about variance in the machine learning sense – i.e., how much a model varies with changes to the training dataset – not variance in the statistical sense which measures the spread of a distribution. Because bagging helps reduce an ML model’s variance, it will often improve models that are high variancebut won’t do much good for models that are low variance. Now that we understand when bagging helps, let’s get into the details of the inner workings to understand how it helps! The bagging algorithm is iterative in nature – it builds multiple models by repeating the following three steps: Bootstrap a dataset from the original training data Train a model on the bootstrapped dataset the trained model The collection of models created in this process is called an ensemble. When it is time to make a prediction, each model in the ensemble makes its own prediction – the final bagged prediction is the averageor majority voteof all of the ensemble’s predictions. Now that we understand how bagging works, let’s take a few minutes to build an intuition for why it works. We’ll borrow a familiar idea from traditional statistics: sampling to estimate a population mean. In statistics, each sample drawn from a distribution is a random variable. Small sample sizes tend to have high variance and may provide poor estimates of the true mean. But as we collect more samples, the average of those samples becomes a much better approximation of the population mean. Similarly, we can think of each of our individual decision trees as a random variable — after all, each tree is trained on a different random sample of the data! By averaging predictions from many trees, bagging reduces variance and produces an ensemble model that better captures the true relationships in the data. Bagging Example We will be using the load_diabetes1 dataset from the scikit-learn Python package to illustrate a simple bagging example. The dataset has 10 input variables – Age, Sex, BMI, Blood Pressure and 6 blood serum levels. And a single output variable that is a measurement of disease progression. The code below pulls in our data and does some very simple cleaning. With our dataset established, let’s start modeling! # pull in and format data from sklearn.datasets import load_diabetes diabetes = load_diabetesdf = pd.DataFramedf.loc= diabetes.target df = df.dropnaFor our example, we will use basic decision trees as our base models for bagging. Let’s first verify that our decision trees are indeed high variance. We will do this by training three decision trees on different bootstrapped datasets and observing the variance of the predictions for a test dataset. The graph below shows the predictions of three different decision trees on the same test dataset. Each dotted vertical line is an individual observation from the test dataset. The three dots on each line are the predictions from the three different decision trees. Variance of decision trees on test data points – image by author In the chart above, we see that individual trees can give very different predictionswhen trained on bootstrapped datasets. This is the variance we have been talking about! Now that we see that our trees aren’t very robust to training samples – let’s average the predictions to see how bagging can help! The chart below shows the average of the three trees. The diagonal line represents perfect predictions. As you can see, with bagging, our points are tighter and more centered around the diagonal. image by author We’ve already seen significant improvement in our model with the average of just three trees. Let’s beef up our bagging algorithm with more trees! Here is the code to bag as many trees as we want: def train_bagging_trees: ''' Creates a decision tree bagging model by training multiple decision trees on bootstrapped data. inputs df: training data with both target and input columns target_col: name of target column pred_cols: list of predictor column names n_trees: number of trees to be trained in the ensemble output: train_trees: list of trained trees ''' train_trees =for i in range: # bootstrap training data temp_boot = bootstrap#train tree temp_tree = plain_vanilla_tree# save trained tree in list train_trees.appendreturn train_trees def bagging_trees_pred: ''' Takes a list of bagged trees and creates predictions by averaging the predictions of each individual tree. inputs df: training data with both target and input columns train_trees: ensemble model - which is a list of trained decision trees target_col: name of target column pred_cols: list of predictor column names output: avg_preds: list of predictions from the ensembled trees ''' x = dfy = dfpreds =# make predictions on data with each decision tree for tree in train_trees: temp_pred = tree.predictpreds.append# get average of the trees' predictions sum_preds =avg_preds =return avg_preds The functions above are very simple, the first trains the bagging ensemble model, the second takes the ensembleand makes predictions given a dataset. With our code established, let’s run multiple ensemble models and see how our out-of-bag predictions change as we increase the number of trees. Out-of-bag predictions vs. actuals colored by number of bagged trees – image by author Admittedly, this chart looks a little crazy. Don’t get too bogged down with all of the individual data points, the lines dashed tell the main story! Here we have 1 basic decision tree model and 3 bagged decision tree models – with 3, 50 and 150 trees. The color-coded dotted lines mark the upper and lower ranges for each model’s residuals. There are two main takeaways here:as we add more trees, the range of the residuals shrinks andthere is diminishing returns to adding more trees – when we go from 1 to 3 trees, we see the range shrink a lot, when we go from 50 to 150 trees, the range tightens just a little. Now that we’ve successfully gone through a full bagging example, we are about ready to move onto boosting! Let’s do a quick overview of what we covered in this section: Bagging reduces variance of ML models by averaging the predictions of multiple individual models Bagging is most helpful with high-variance models The more models we bag, the lower the variance of the ensemble – but there are diminishing returns to the variance reduction benefit Okay, let’s move on to boosting! Boosting: Reducing bias in weak learners With bagging, we create multiple independent models – the independence of the models helps average out the noise of individual models. Boosting is also an ensembling technique; similar to bagging, we will be training multiple models…. But very different from bagging, the models we train will be dependent. Boosting is a modeling technique that trains an initial model and then sequentially trains additional models to improve the predictions of prior models. The primary target of boosting is to reduce bias – though it can also help reduce variance. We’ve established that boosting iteratively improves predictions – let’s go deeper into how. Boosting algorithms can iteratively improve model predictions in two ways: Directly predicting the residuals of the last model and adding them to the prior predictions – think of it as residual corrections Adding more weight to the observations that the prior model predicted poorly Because boosting’s main goal is to reduce bias, it works well with base models that typically have more bias. For our examples, we are going to use shallow decision trees as our base model – we will only cover the residual prediction approach in this article for brevity. Let’s jump into the boosting example! Predicting prior residuals The residuals prediction approach starts off with an initial modeland we calculate the residuals of that initial prediction. The second model in the ensemble predicts the residuals of the first model. With our residual predictions in-hand, we add the residual predictions to our initial predictionand recalculate the updated residuals…. we continue this process until we have created the number of base models we specified. This process is pretty simple, but is a little hard to explain with just words – the flowchart below shows a simple, 4-model boosting algorithm. Flowchart of simple, 4 model boosting algorithm – image by author When boosting, we need to set three main parameters:the number of trees,the tree depth andthe learning rate. I’ll spend a little time discussing these inputs now. Number of Trees For boosting, the number of trees means the same thing as in bagging – i.e., the total number of trees that will be trained for the ensemble. But, unlike boosting, we should not err on the side of more trees! The chart below shows the test RMSE against the number of trees for the diabetes dataset. Unlike with bagging, too many trees in boosting leads to overfitting! – image by author This shows that the test RMSE drops quickly with the number of trees up until about 200 trees, then it starts to creep back up. It looks like a classic ‘overfitting’ chart – we reach a point where more trees becomes worse for the model. This is a key difference between bagging and boosting – with bagging, more trees eventually stop helping, with boosting more trees eventually start hurting! With bagging, more trees eventually stops helping, with boosting more trees eventually starts hurting! We now know that too many trees are bad, and too few trees are bad as well. We will use hyperparameter tuning to select the number of trees. Note – hyperparameter tuning is a huge subject and way outside of the scope of this article. I’ll demonstrate a simple grid search with a train and test dataset for our example a little later. Tree Depth This is the maximum depth for each tree in the ensemble. With bagging, trees are often allowed to go as deep they want because we are looking for low bias, high variance models. With boosting however, we use sequential models to address the bias in the base learners – so we aren’t as concerned about generating low-bias trees. How do we decide how the maximum depth? The same technique that we’ll use with the number of trees, hyperparameter tuning. Learning Rate The number of trees and the tree depth are familiar parameters from bagging– but this ‘learning rate’ character is a new face! Let’s take a moment to get familiar. The learning rate is a number between 0 and 1 that is multiplied by the current model’s residual predictions before it is added to the overall predictions. Here’s a simple example of the prediction calculations with a learning rate of 0.5. Once we understand the mechanics of how the learning rate works, we will discuss the why the learning rate is important. The learning rate discounts the residual prediction before updating the actual target prediction – image by author So, why would we want to ‘discount’ our residual predictions, wouldn’t that make our predictions worse? Well, yes and no. For a single iteration, it will likely make our predictions worse – but, we are doing multiple iterations. For multiple iterations, the learning rate keeps the model from overreacting to a single tree’s predictions. It will probably make our current predictions worse, but don’t worry, we will go through this process multiple times! Ultimately, the learning rate helps mitigate overfitting in our boosting model by lowering the influence of any single tree in the ensemble. You can think of it as slowly turning the steering wheel to correct your driving rather than jerking it. In practice, the number of trees and the learning rate have an opposite relationship, i.e., as the learning rate goes down, the number of trees goes up. This is intuitive, because if we only allow a small amount of each tree’s residual prediction to be added to the overall prediction, we are going to need a lot more trees before our overall prediction will start looking good. Ultimately, the learning rate helps mitigate overfitting in our boosting model by lowering the influence of any single tree in the ensemble. Alright, now that we’ve covered the main inputs in boosting, let’s get into the Python coding! We need a couple of functions to create our boosting algorithm: Base decision tree function – a simple function to create and train a single decision tree. We will use the same function from the last section called ‘plain_vanilla_tree.’ Boosting training function – this function sequentially trains and updates residuals for as many decision trees as the user specifies. In our code, this function is called ‘boost_resid_correction.’ Boosting prediction function – this function takes a series of boosted models and makes final ensemble predictions. We call this function ‘boost_resid_correction_pred.’ Here are the functions written in Python: # same base tree function as in prior section def plain_vanilla_tree: X_train = df_trainy_train = df_traintree = DecisionTreeRegressorif weights: tree.fitelse: tree.fitreturn tree # residual predictions def boost_resid_correction: ''' Creates boosted decision tree ensemble model. Inputs: df_train: contains training data target_col: name of target column pred_col: target column names num_models: number of models to use in boosting learning_rate: discount given to residual predictions takes values between: max depth of each tree model Outputs: boosting_model: contains everything needed to use model to make predictions - includes list of all trees in the ensemble ''' # create initial predictions model1 = plain_vanilla_treeinitial_preds = model1.predictdf_train= df_train- initial_preds # create multiple models, each predicting the updated residuals models =for i in range: temp_model = plain_vanilla_treemodels.appendtemp_pred_resids = temp_model.predictdf_train= df_train-boosting_model = {'initial_model' : model1, 'models' : models, 'learning_rate' : learning_rate, 'pred_cols' : pred_cols} return boosting_model # This function takes the residual boosted model and scores data def boost_resid_correction_predict: ''' Creates predictions on a dataset given a boosted model. Inputs: df: data to make predictions boosting_models: dictionary containing all pertinent boosted model data chart: indicates if performance chart should be created Outputs: pred: predictions from boosted model rmse: RMSE of predictions ''' # get initial predictions initial_model = boosting_modelspred_cols = boosting_modelspred = initial_model.predict# calculate residual predictions from each model and add models = boosting_modelslearning_rate = boosting_modelsfor model in models: temp_resid_preds = model.predictpred += learning_rate*temp_resid_preds if chart: plt.scatterplt.showrmse = np.sqrt) return pred, rmse Sweet, let’s make a model on the same diabetes dataset that we used in the bagging section. We’ll do a quick grid searchto tune our three parameters and then we’ll train the final model using the boost_resid_correction function. # tune parameters with grid search n_trees =learning_rates =max_depths = my_list = list) # Create a dictionary to hold test RMSE for each 'square' in grid perf_dict = {} for tree in n_trees: for learning_rate in learning_rates: for max_depth in max_depths: temp_boosted_model = boost_resid_correctiontemp_boosted_model= 'target' preds, rmse = boost_resid_correction_predictdict_key = '_'.joinfor x in) perf_dict= rmse min_key = minprintAnd our winner is — 50 trees, a learning rate of 0.1 and a max depth of 1! Let’s take a look and see how our predictions did. Tuned boosting actuals vs. residuals – image by author While our boosting ensemble model seems to capture the trend reasonably well, we can see off the bat that it isn’t predicting as well as the bagging model. We could probably spend more time tuning – but it could also be the case that the bagging approach fits this specific data better. With that said, we’ve now earned an understanding of bagging and boosting – let’s compare them in the next section! Bagging vs. Boosting – understanding the differences We’ve covered bagging and boosting separately, the table below brings all the information we’ve covered to concisely compare the approaches: image by author Note: In this article, we wrote our own bagging and boosting code for educational purposes. In practice you will just use the excellent code that is available in Python packages or other software. Also, people rarely use ‘pure’ bagging or boosting – it is much more common to use more advanced algorithms that modify the plain vanilla bagging and boosting to improve performance. Wrapping it up Bagging and boosting are powerful and practical ways to improve weak learners like the humble but flexible decision tree. Both approaches use the power of ensembling to address different problems – bagging for variance, boosting for bias. In practice, pre-packaged code is almost always used to train more advanced machine learning models that use the main ideas of bagging and boosting but, expand on them with multiple improvements. I hope that this has been helpful and interesting – happy modeling! Dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases and is distributed under the public domain license for use without restriction. The post Strength in Numbers: Ensembling Models with Bagging and Boosting appeared first on Towards Data Science. #strength #numbers #ensembling #models #with
    TOWARDSDATASCIENCE.COM
    Strength in Numbers: Ensembling Models with Bagging and Boosting
    Bagging and boosting are two powerful ensemble techniques in machine learning – they are must-knows for data scientists! After reading this article, you are going to have a solid understanding of how bagging and boosting work and when to use them. We’ll cover the following topics, relying heavily on examples to give hands-on illustration of the key concepts: How Ensembling helps create powerful models Bagging: Adding stability to ML models Boosting: Reducing bias in weak learners Bagging vs. Boosting – when to use each and why Creating powerful models with ensembling In Machine Learning, ensembling is a broad term that refers to any technique that creates predictions by combining the predictions from multiple models. If there is more than one model involved in making a prediction, the technique is using ensembling! Ensembling approaches can often improve the performance of a single model. Ensembling can help reduce: Variance by averaging multiple models Bias by iteratively improving on errors Overfitting because using multiple models can increase robustness to spurious relationships Bagging and boosting are both ensemble methods that can perform much better than their single-model counterparts. Let’s get into the details of these now! Bagging: Adding stability to ML models Bagging is a specific ensembling technique that is used to reduce the variance of a predictive model. Here, I’m talking about variance in the machine learning sense – i.e., how much a model varies with changes to the training dataset – not variance in the statistical sense which measures the spread of a distribution. Because bagging helps reduce an ML model’s variance, it will often improve models that are high variance (e.g., decision trees and KNN) but won’t do much good for models that are low variance (e.g., linear regression). Now that we understand when bagging helps (high variance models), let’s get into the details of the inner workings to understand how it helps! The bagging algorithm is iterative in nature – it builds multiple models by repeating the following three steps: Bootstrap a dataset from the original training data Train a model on the bootstrapped dataset Save the trained model The collection of models created in this process is called an ensemble. When it is time to make a prediction, each model in the ensemble makes its own prediction – the final bagged prediction is the average (for regression) or majority vote (for classification) of all of the ensemble’s predictions. Now that we understand how bagging works, let’s take a few minutes to build an intuition for why it works. We’ll borrow a familiar idea from traditional statistics: sampling to estimate a population mean. In statistics, each sample drawn from a distribution is a random variable. Small sample sizes tend to have high variance and may provide poor estimates of the true mean. But as we collect more samples, the average of those samples becomes a much better approximation of the population mean. Similarly, we can think of each of our individual decision trees as a random variable — after all, each tree is trained on a different random sample of the data! By averaging predictions from many trees, bagging reduces variance and produces an ensemble model that better captures the true relationships in the data. Bagging Example We will be using the load_diabetes1 dataset from the scikit-learn Python package to illustrate a simple bagging example. The dataset has 10 input variables – Age, Sex, BMI, Blood Pressure and 6 blood serum levels (S1-S6). And a single output variable that is a measurement of disease progression. The code below pulls in our data and does some very simple cleaning. With our dataset established, let’s start modeling! # pull in and format data from sklearn.datasets import load_diabetes diabetes = load_diabetes(as_frame=True) df = pd.DataFrame(diabetes.data, columns=diabetes.feature_names) df.loc[:, 'target'] = diabetes.target df = df.dropna() For our example, we will use basic decision trees as our base models for bagging. Let’s first verify that our decision trees are indeed high variance. We will do this by training three decision trees on different bootstrapped datasets and observing the variance of the predictions for a test dataset. The graph below shows the predictions of three different decision trees on the same test dataset. Each dotted vertical line is an individual observation from the test dataset. The three dots on each line are the predictions from the three different decision trees. Variance of decision trees on test data points – image by author In the chart above, we see that individual trees can give very different predictions (spread of the three dots on each vertical line) when trained on bootstrapped datasets. This is the variance we have been talking about! Now that we see that our trees aren’t very robust to training samples – let’s average the predictions to see how bagging can help! The chart below shows the average of the three trees. The diagonal line represents perfect predictions. As you can see, with bagging, our points are tighter and more centered around the diagonal. image by author We’ve already seen significant improvement in our model with the average of just three trees. Let’s beef up our bagging algorithm with more trees! Here is the code to bag as many trees as we want: def train_bagging_trees(df, target_col, pred_cols, n_trees): ''' Creates a decision tree bagging model by training multiple decision trees on bootstrapped data. inputs df (pandas DataFrame) : training data with both target and input columns target_col (str) : name of target column pred_cols (list) : list of predictor column names n_trees (int) : number of trees to be trained in the ensemble output: train_trees (list) : list of trained trees ''' train_trees = [] for i in range(n_trees): # bootstrap training data temp_boot = bootstrap(train_df) #train tree temp_tree = plain_vanilla_tree(temp_boot, target_col, pred_cols) # save trained tree in list train_trees.append(temp_tree) return train_trees def bagging_trees_pred(df, train_trees, target_col, pred_cols): ''' Takes a list of bagged trees and creates predictions by averaging the predictions of each individual tree. inputs df (pandas DataFrame) : training data with both target and input columns train_trees (list) : ensemble model - which is a list of trained decision trees target_col (str) : name of target column pred_cols (list) : list of predictor column names output: avg_preds (list) : list of predictions from the ensembled trees ''' x = df[pred_cols] y = df[target_col] preds = [] # make predictions on data with each decision tree for tree in train_trees: temp_pred = tree.predict(x) preds.append(temp_pred) # get average of the trees' predictions sum_preds = [sum(x) for x in zip(*preds)] avg_preds = [x / len(train_trees) for x in sum_preds] return avg_preds The functions above are very simple, the first trains the bagging ensemble model, the second takes the ensemble (simply a list of trained trees) and makes predictions given a dataset. With our code established, let’s run multiple ensemble models and see how our out-of-bag predictions change as we increase the number of trees. Out-of-bag predictions vs. actuals colored by number of bagged trees – image by author Admittedly, this chart looks a little crazy. Don’t get too bogged down with all of the individual data points, the lines dashed tell the main story! Here we have 1 basic decision tree model and 3 bagged decision tree models – with 3, 50 and 150 trees. The color-coded dotted lines mark the upper and lower ranges for each model’s residuals. There are two main takeaways here: (1) as we add more trees, the range of the residuals shrinks and (2) there is diminishing returns to adding more trees – when we go from 1 to 3 trees, we see the range shrink a lot, when we go from 50 to 150 trees, the range tightens just a little. Now that we’ve successfully gone through a full bagging example, we are about ready to move onto boosting! Let’s do a quick overview of what we covered in this section: Bagging reduces variance of ML models by averaging the predictions of multiple individual models Bagging is most helpful with high-variance models The more models we bag, the lower the variance of the ensemble – but there are diminishing returns to the variance reduction benefit Okay, let’s move on to boosting! Boosting: Reducing bias in weak learners With bagging, we create multiple independent models – the independence of the models helps average out the noise of individual models. Boosting is also an ensembling technique; similar to bagging, we will be training multiple models…. But very different from bagging, the models we train will be dependent. Boosting is a modeling technique that trains an initial model and then sequentially trains additional models to improve the predictions of prior models. The primary target of boosting is to reduce bias – though it can also help reduce variance. We’ve established that boosting iteratively improves predictions – let’s go deeper into how. Boosting algorithms can iteratively improve model predictions in two ways: Directly predicting the residuals of the last model and adding them to the prior predictions – think of it as residual corrections Adding more weight to the observations that the prior model predicted poorly Because boosting’s main goal is to reduce bias, it works well with base models that typically have more bias (e.g., shallow decision trees). For our examples, we are going to use shallow decision trees as our base model – we will only cover the residual prediction approach in this article for brevity. Let’s jump into the boosting example! Predicting prior residuals The residuals prediction approach starts off with an initial model (some algorithms provide a constant, others use one iteration of the base model) and we calculate the residuals of that initial prediction. The second model in the ensemble predicts the residuals of the first model. With our residual predictions in-hand, we add the residual predictions to our initial prediction (this gives us residual corrected predictions) and recalculate the updated residuals…. we continue this process until we have created the number of base models we specified. This process is pretty simple, but is a little hard to explain with just words – the flowchart below shows a simple, 4-model boosting algorithm. Flowchart of simple, 4 model boosting algorithm – image by author When boosting, we need to set three main parameters: (1) the number of trees, (2) the tree depth and (3) the learning rate. I’ll spend a little time discussing these inputs now. Number of Trees For boosting, the number of trees means the same thing as in bagging – i.e., the total number of trees that will be trained for the ensemble. But, unlike boosting, we should not err on the side of more trees! The chart below shows the test RMSE against the number of trees for the diabetes dataset. Unlike with bagging, too many trees in boosting leads to overfitting! – image by author This shows that the test RMSE drops quickly with the number of trees up until about 200 trees, then it starts to creep back up. It looks like a classic ‘overfitting’ chart – we reach a point where more trees becomes worse for the model. This is a key difference between bagging and boosting – with bagging, more trees eventually stop helping, with boosting more trees eventually start hurting! With bagging, more trees eventually stops helping, with boosting more trees eventually starts hurting! We now know that too many trees are bad, and too few trees are bad as well. We will use hyperparameter tuning to select the number of trees. Note – hyperparameter tuning is a huge subject and way outside of the scope of this article. I’ll demonstrate a simple grid search with a train and test dataset for our example a little later. Tree Depth This is the maximum depth for each tree in the ensemble. With bagging, trees are often allowed to go as deep they want because we are looking for low bias, high variance models. With boosting however, we use sequential models to address the bias in the base learners – so we aren’t as concerned about generating low-bias trees. How do we decide how the maximum depth? The same technique that we’ll use with the number of trees, hyperparameter tuning. Learning Rate The number of trees and the tree depth are familiar parameters from bagging (although in bagging we often didn’t put a limit on the tree depth) – but this ‘learning rate’ character is a new face! Let’s take a moment to get familiar. The learning rate is a number between 0 and 1 that is multiplied by the current model’s residual predictions before it is added to the overall predictions. Here’s a simple example of the prediction calculations with a learning rate of 0.5. Once we understand the mechanics of how the learning rate works, we will discuss the why the learning rate is important. The learning rate discounts the residual prediction before updating the actual target prediction – image by author So, why would we want to ‘discount’ our residual predictions, wouldn’t that make our predictions worse? Well, yes and no. For a single iteration, it will likely make our predictions worse – but, we are doing multiple iterations. For multiple iterations, the learning rate keeps the model from overreacting to a single tree’s predictions. It will probably make our current predictions worse, but don’t worry, we will go through this process multiple times! Ultimately, the learning rate helps mitigate overfitting in our boosting model by lowering the influence of any single tree in the ensemble. You can think of it as slowly turning the steering wheel to correct your driving rather than jerking it. In practice, the number of trees and the learning rate have an opposite relationship, i.e., as the learning rate goes down, the number of trees goes up. This is intuitive, because if we only allow a small amount of each tree’s residual prediction to be added to the overall prediction, we are going to need a lot more trees before our overall prediction will start looking good. Ultimately, the learning rate helps mitigate overfitting in our boosting model by lowering the influence of any single tree in the ensemble. Alright, now that we’ve covered the main inputs in boosting, let’s get into the Python coding! We need a couple of functions to create our boosting algorithm: Base decision tree function – a simple function to create and train a single decision tree. We will use the same function from the last section called ‘plain_vanilla_tree.’ Boosting training function – this function sequentially trains and updates residuals for as many decision trees as the user specifies. In our code, this function is called ‘boost_resid_correction.’ Boosting prediction function – this function takes a series of boosted models and makes final ensemble predictions. We call this function ‘boost_resid_correction_pred.’ Here are the functions written in Python: # same base tree function as in prior section def plain_vanilla_tree(df_train, target_col, pred_cols, max_depth = 3, weights=[]): X_train = df_train[pred_cols] y_train = df_train[target_col] tree = DecisionTreeRegressor(max_depth = max_depth, random_state=42) if weights: tree.fit(X_train, y_train, sample_weights=weights) else: tree.fit(X_train, y_train) return tree # residual predictions def boost_resid_correction(df_train, target_col, pred_cols, num_models, learning_rate=1, max_depth=3): ''' Creates boosted decision tree ensemble model. Inputs: df_train (pd.DataFrame) : contains training data target_col (str) : name of target column pred_col (list) : target column names num_models (int) : number of models to use in boosting learning_rate (float, def = 1) : discount given to residual predictions takes values between (0, 1] max_depth (int, def = 3) : max depth of each tree model Outputs: boosting_model (dict) : contains everything needed to use model to make predictions - includes list of all trees in the ensemble ''' # create initial predictions model1 = plain_vanilla_tree(df_train, target_col, pred_cols, max_depth = max_depth) initial_preds = model1.predict(df_train[pred_cols]) df_train['resids'] = df_train[target_col] - initial_preds # create multiple models, each predicting the updated residuals models = [] for i in range(num_models): temp_model = plain_vanilla_tree(df_train, 'resids', pred_cols) models.append(temp_model) temp_pred_resids = temp_model.predict(df_train[pred_cols]) df_train['resids'] = df_train['resids'] - (learning_rate*temp_pred_resids) boosting_model = {'initial_model' : model1, 'models' : models, 'learning_rate' : learning_rate, 'pred_cols' : pred_cols} return boosting_model # This function takes the residual boosted model and scores data def boost_resid_correction_predict(df, boosting_models, chart = False): ''' Creates predictions on a dataset given a boosted model. Inputs: df (pd.DataFrame) : data to make predictions boosting_models (dict) : dictionary containing all pertinent boosted model data chart (bool, def = False) : indicates if performance chart should be created Outputs: pred (np.array) : predictions from boosted model rmse (float) : RMSE of predictions ''' # get initial predictions initial_model = boosting_models['initial_model'] pred_cols = boosting_models['pred_cols'] pred = initial_model.predict(df[pred_cols]) # calculate residual predictions from each model and add models = boosting_models['models'] learning_rate = boosting_models['learning_rate'] for model in models: temp_resid_preds = model.predict(df[pred_cols]) pred += learning_rate*temp_resid_preds if chart: plt.scatter(df['target'], pred) plt.show() rmse = np.sqrt(mean_squared_error(df['target'], pred)) return pred, rmse Sweet, let’s make a model on the same diabetes dataset that we used in the bagging section. We’ll do a quick grid search (again, not doing anything fancy with the tuning here) to tune our three parameters and then we’ll train the final model using the boost_resid_correction function. # tune parameters with grid search n_trees = [5,10,30,50,100,125,150,200,250,300] learning_rates = [0.001, 0.01, 0.1, 0.25, 0.50, 0.75, 0.95, 1] max_depths = my_list = list(range(1, 16)) # Create a dictionary to hold test RMSE for each 'square' in grid perf_dict = {} for tree in n_trees: for learning_rate in learning_rates: for max_depth in max_depths: temp_boosted_model = boost_resid_correction(train_df, 'target', pred_cols, tree, learning_rate=learning_rate, max_depth=max_depth) temp_boosted_model['target_col'] = 'target' preds, rmse = boost_resid_correction_predict(test_df, temp_boosted_model) dict_key = '_'.join(str(x) for x in [tree, learning_rate, max_depth]) perf_dict[dict_key] = rmse min_key = min(perf_dict, key=perf_dict.get) print(perf_dict[min_key]) And our winner is — 50 trees, a learning rate of 0.1 and a max depth of 1! Let’s take a look and see how our predictions did. Tuned boosting actuals vs. residuals – image by author While our boosting ensemble model seems to capture the trend reasonably well, we can see off the bat that it isn’t predicting as well as the bagging model. We could probably spend more time tuning – but it could also be the case that the bagging approach fits this specific data better. With that said, we’ve now earned an understanding of bagging and boosting – let’s compare them in the next section! Bagging vs. Boosting – understanding the differences We’ve covered bagging and boosting separately, the table below brings all the information we’ve covered to concisely compare the approaches: image by author Note: In this article, we wrote our own bagging and boosting code for educational purposes. In practice you will just use the excellent code that is available in Python packages or other software. Also, people rarely use ‘pure’ bagging or boosting – it is much more common to use more advanced algorithms that modify the plain vanilla bagging and boosting to improve performance. Wrapping it up Bagging and boosting are powerful and practical ways to improve weak learners like the humble but flexible decision tree. Both approaches use the power of ensembling to address different problems – bagging for variance, boosting for bias. In practice, pre-packaged code is almost always used to train more advanced machine learning models that use the main ideas of bagging and boosting but, expand on them with multiple improvements. I hope that this has been helpful and interesting – happy modeling! Dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases and is distributed under the public domain license for use without restriction. The post Strength in Numbers: Ensembling Models with Bagging and Boosting appeared first on Towards Data Science.
    0 Σχόλια 0 Μοιράστηκε