Boosting and Bagging: How To Develop A Robust Machine Learning Algorithm
Machine learning and data science require more than just throwing data into a Python library and utilizing whatever comes out
Machine learning and data science require more than just throwing data into a Python library and utilizing whatever comes out.
Data scientists need to actually understand the data, and the processes behind it, to be able to implement a successful system.
One key methodology to implementation is knowing when a model might benefit from utilizing bootstrapping methods. These are what are called ensemble models. Some examples of ensemble models are AdaBoost and Stochastic Gradient Boosting.
Why use ensemble models?
They can help improve algorithm accuracy or make a model more robust. Two examples of this are boosting and bagging. Boosting and bagging are topics that data scientists and machine learning engineers must know, especially if you are planning to go in for a data science/machine learning interview.
Essentially, ensemble learning is true to the word ensemble. Except, instead of having several people singing at different octaves create one beautiful harmony (each voice filling in the void of the other), ensemble learning uses hundreds to thousands of models of the same algorithm that work together to find the correct classification.
Another way to think about ensemble learning is the fable of the blind men and the elephant. In this example, each blind man feels a different part of an elephant, so they disagree on what they’re feeling. However, had they come together and discussed it, they might have been able to figure out they were looking at different parts of the same thing.
Using techniques like boosting and bagging has led to increased robustness of statistical models and decreased variance.
Now the question becomes, what is the difference between all these different “B” words?
Let’s first talk about the very important concept of bootstrapping. Many data scientists miss this and go straight to explaining boosting and bagging. But both require bootstrapping.
In machine learning, the bootstrap method refers to random sampling with replacement. This sample is referred to as a resample. This allows the model or algorithm to get a better understanding of the various biases, variances, and features that exist in the resample. Taking a sample of the data allows the resample to contain different characteristics than it might have contained as a whole. This is demonstrated in Figure 1, where each sample population has different pieces and none are identical. This would then affect the overall mean, standard deviation, and other descriptive metrics of a data set. In turn, it can develop more robust models.
Bootstrapping is also great for small-size data sets that can have a tendency to overfit. In fact, we recommended this to one company who was concerned that their data sets were far from “Big Data.” Bootstrapping can be a solution in this case because algorithms that utilize bootstrapping can be more robust and handle new data sets, depending on the methodology (boosting or bagging).
The reason to use the bootstrap method is because it can test the stability of a solution. It can increase robustness by using multiple sample data sets and testing multiple models. Perhaps one sample data set has a larger mean than another, or a different standard deviation. This might break a model that was overfit and not tested using data sets with different variations.
One of the many reasons bootstrapping has become common is because of the increase in computing power. This allows for many more permutations to be done with different resamples than otherwise possible. Bootstrapping is used in both bagging and boosting, as will be discussed below.
Bagging actually refers to (Bootstrap Aggregators). Almost any paper or post that references using bagging algorithms will also reference Leo Breiman, who wrote a paper in 1996 called “Bagging Predictors”.
Where Leo describes bagging as:
“Bagging predictors is a method for generating multiple versions of a predictor and using these to get an aggregated predictor.”
Bagging helps reduce variance from models that might be very accurate, but only on the data they were trained on. This is also known as overfitting.
Overfitting is when a function fits the data too well. Typically, this is because the actual equation is much too complicated to take each data point and outlier into account.
Another example of an algorithm that can overfit easily is a decision tree. The models that are developed using decision trees require very simple heuristics. Decision trees are composed of a set of “if-else” statements done in a specific order. Thus, if the data set is changed to a new data set that might have some bias or difference instead of underlying features compared to the previous set, the model will fail to be as accurate. This is because the data will not fit the model as well (which is a backwards statement anyways).
Bagging gets around this by creating it’s own variance amongst the data by sampling and replacing data while it tests multiple hypothesis (models). In turn, this reduces the noise by utilizing multiple samples that would most likely be made up of data with various attributes (median, average, etc).
Once each model has developed a hypothesis, the models use voting for classification or averaging for regression. This is where the “Aggregating” in “Bootstrap Aggregating” comes into play. Each hypothesis has the same weight as all the others. When we later discuss boosting, this is one of the places where the two methodologies differ.
Essentially, all these models run at the same time, and the vote on which hypothesis is the most accurate.
This helps to decrease variance i.e. reduce the overfit.
Boosting refers to a group of algorithms that utilize weighted averages to make weak learners into stronger learners. Unlike bagging, which had each model run independently and then aggregate the outputs at the end without preference to any model, boosting is all about “teamwork.” Each model that runs dictates which features the next model will focus on.
Boosting also requires bootstrapping. However, there is another difference here. Unlike in bagging, boosting weights each sample of data. This means some samples will run more often than others.
Why put weights on the data samples?
When boosting runs each model, it tracks which data samples are the most successful and which are not. The data sets with the most misclassified outputs are given heavier weights. These are considered to be data that have more complexity and require more iterations to properly train the model.
During the actual classification stage, there is also a difference in how boosting treats the models. In boosting, the model’s error rates are tracked because better models are given better weights.
That way, when the “voting” occurs, like in bagging, the models with better outcomes have a stronger pull on the final output.
Boosting and bagging are both great techniques to decrease variance. Ensemble methods generally outperform a single model. This is why many of the Kaggle winners have utilized ensemble methodologies. One that was not discussed here was stacking. (That requires its own post.)
However, they won’t fix every problem, and they themselves have their own problems. There are different reasons you would use one over the other. Bagging is great for decreasing variance when a model is overfit. However, boosting is much more likely to be a better pick of the two methods. Boosting is also much more likely to cause performance issues. It is also great for decreasing bias in an underfit model.
This is where experience and subject matter expertise come in! It can be easy to jump on the first model that works. However, it is important to analyze the algorithm and all of the features it selects. For instance, if a decision tree sets specific leafs, the question becomes why! If you can’t support it with other data points and visuals, it probably shouldn’t be implemented.
It is not just about trying AdaBoost, or Random forests on various data sets. The final algorithm depends on the results it’s getting, and what support is there.