Boosting Algorithms
In Machine Learning
Ensemble Learning and Ensemble Method
- Ensemble Learning is a method that is used to enhance the performance of Machine Learning model by combining several learners.
- When compared a single model , this type of learning builds models with improved efficiency and accuracy.
- Suppose you ask a complex question to thousands of random people, then aggregate their answers. In many cases you will find that this aggregated answer is better than an expert’s answer. This is called the wisdom of the crowd. Similarly, if you aggregate the predictions of a group of predictors (such as classifiers or regressors), you will often get better predictions than with the best individual predictor. A group of predictors is called an ensemble; thus, this technique is called Ensemble Learning, and an Ensemble Learning algorithm is called an Ensemble method.
Different parts of Ensemble Methods
Bagging v/s Boosting
Boosting
- Boosting (originally called hypothesis boosting) refers to any Ensemble method that can combine several weak learners into a strong learner. The general idea of most boosting methods is to train predictors sequentially, each trying to correct its predecessor. There are many boosting methods available, but by far the most popular are Ada Boost (short for Adaptive Boosting) and Gradient Boosting.
- The boosting algorithms are primarily used in machine learning for reducing bias and variance.
- While boosting is not algorithmically constrained, most boosting algorithms consist of iteratively learning weak classifiers with respect to a distribution and adding them to a final strong classifier.
Boosting is a process that uses a set of Machine Learning algorithms to combine weak learner to form strong learners in order to increase the accuracy of the model.
Working of Boosting Algorithms
Boosting Algorithms
- Ada Boost (Adaptive Boosting)
- Gradient boosting
- XG Boost (Xtreme Gradient Boosting)
Ada Boost
- Ada boost or adaptive boosting is the first stepping stone in the world of boosting algorithms.
- Ada boost is the first boosting algorithm among all the boosting algorithms. The main idea behind this algorithm is to combine multiple weak learners to into a strong classifier. It can be used for both classification and regression tasks.
- One way for a new predictor to correct its predecessor is to pay a bit more attention to the training instances that the predecessor has incorrectly classified. This results in new predictors focusing more and more on the hard cases. This is the technique used by Ada‐Boost.
- For example, to build an Ada Boost classifier, a first base classifier (such as a Decision Tree) is trained and used to make predictions on the training set. The relative weight of misclassified training instances is then increased. A second classifier is trained using the updated weights and again it makes predictions on the training set, weights are updated, and so on
- When the random forest is used, the algorithm makes n number of trees. It makes proper trees that consist of a start node with several leaves nodes. Some trees might be bigger than others, but there is no fixed depth in a random forest. But with Adaboost, that’s not the case. In AdaBoost, the algorithm only makes a node with two leaves, and this is known as Stump.
Note – The figure shown above represents the stump. It can be seen clearly that it has only one node with only two leaves. These stumps are weak learners, and boosting techniques prefer this. The order of stumps is very important in AdaBoost. The error of the first stump influences how the other stump is made.
Working of Ada Boost
Step 1:
At very first step Ada Boost reads the dataset and assigns equal weight to each feature.
Initially, all the samples will have the same weight : W=1/N
Row No. | Feature 1 | Feature 2 | Feature 3 | Output | Sample Weight |
1 | X11 | X12 | X13 | Yes | 1/5 |
2 | X21 | X22 | X23 | Yes | 1/5 |
3 | X31 | X32 | X33 | No | 1/5 |
4 | X41 | X42 | X43 | No | 1/5 |
5 | X51 | X52 | X53 | Yes | 1/5 |
Step 2:
In this step Ada boost will create it’s first base learner i.e. the first stump. It will create the same number of stumps as the number of features.
- Out of these 3 models, the algorithm selects only one. For selecting a base learner, there are two properties, those are, Gini and Entropy. We must calculate Gini or Entropy the same way it is calculated for decision trees. The stump that has the least value will be the first base learner.
- In the figure, all the 3 stumps can be made with 3 features. The number below the leaves represents the correctly and incorrectly classified records. By using these records, the Gini or entropy index is calculated. The stump that has the least entropy or Gini will be selected for the base learner.
Now, we will assume that the entropy index is the least for stump1. So, let’s take stump 1, i.e., feature 1 as our first base learner.
Here feature 1 has misclassified the second record
Row No. | Feature 1 | Feature 2 | Feature 3 | Output | Sample Weight |
1 | X11 | X12 | X13 | Yes | 1/5 |
2 | X21 | X22 | X23 | Yes | 1/5 |
3 | X31 | X32 | X33 | No | 1/5 |
4 | X41 | X42 | X43 | No | 1/5 |
5 | X51 | X52 | X53 | Yes | 1/5 |
Step 3:
Now, Ada boost will calculate the total error by using the formula :
No. of Samples classified as wrong / Total No. of Samples
In our case, there is only 1 error, so Total Error (TE) = 1/5.
Step 4:
After calculating the total error now Ada boost will calculate the performance of stump.
Why its necessary to calculate the TE and performance of stump?
The answer is, we must update the sample weight before proceeding for the next model or stage because if the same weight is applied, we receive the output from the first model. In boosting, only the wrong records/incorrectly classified records got more preference than the correctly classified records. Thus, only the wrong records from the decision tree/stump are passed on to another stump. While in Ada Boost, both records were allowed to pass, the wrong records are repeated more than the correct ones. We must increase the weight for the wrongly classified records and decrease the weight for the correctly classified records. In the next step, we will be updating the weights based on the performance of the stump.
Step 5 :
Now, Ada boost will update the weights of correctly and in correctly classified samples :
Weights of incorrectly classified samples = Weight * ePerf Say
Weights of correctly classified samples = Weight * e-Perf Say
Row No. | Feature 1 | Feature 2 | Feature 3 | Output | Sample Weight | Updated Weight |
1 | X11 | X12 | X13 | Yes | 1/5 | 0.1 |
2 | X21 | X22 | X23 | Yes | 1/5 | 0.399 |
3 | X31 | X32 | X33 | No | 1/5 | 0.1 |
4 | X41 | X42 | X43 | No | 1/5 | 0.1 |
5 | X51 | X52 | X53 | Yes | 1/5 | 0.1 |
After updating the weights now Ada boost tries to normalize the updated weights of the samples :
Row No. | Feature 1 | Feature 2 | Feature 3 | Output | Sample Weight | Updated Weight | Normalized Weight |
1 | X11 | X12 | X13 | Yes | 1/5 | 0.1 | 0.13 |
2 | X21 | X22 | X23 | Yes | 1/5 | 0.399 | 0.50 |
3 | X31 | X32 | X33 | No | 1/5 | 0.1 | 0.13 |
4 | X41 | X42 | X43 | No | 1/5 | 0.1 | 0.13 |
5 | X51 | X52 | X53 | Yes | 1/5 | 0.1 | 0.13 |
Step 6:
Now, in this last step the Ada boost will create a new dataset for the next weak learner
The algorithm will run 5 iterations to select different-different records from the older dataset. In each iteration the algorithm will take one random value and selects the record whose bucket contains that value. Like this there is a high probability for the wrong records to get selected several times.
Normalized Weights | Buckets |
0.13 | 0 – 0.13 |
0.50 | 0.13 – 0.63 |
0.13 | 0.63 – 0.76 |
0.13 | 0.76 – 0.89 |
0.13 | 0.89 – 1.02 |
How does the algorithm decide output for test data?
Suppose with the above dataset, the algorithm constructed 3 decision trees or stumps, the test dataset will pass through all the stumps which have been constructed by the algorithm. While passing through the 1st stump, it gives the output as 1, passing through 2nd stump it again gives the output as 1, and while passing through 3rd stump it gives the output as 0. So, in Ada Boost algorithm also, the majority of votes take place between the stumps, the same as in random trees. And in this case, the final output will be 1. This is how the output with test data is decided.
Advantages of Ada Boost
- Ada Boost has a lot of advantages, mainly it is easier to use with less need for tweaking parameters unlike algorithms like SVM.
- Theoretically, Ada Boost is not prone to overfitting though there is no concrete proof for this. It could be because of the reason that parameters are not jointly optimized — stage-wise estimation slows down the learning process.
- Ada Boost can be used to improve the accuracy of your weak classifiers hence making it flexible.
- It has now being extended beyond binary classification and has found use cases in text and image classification as well.
Disadvantages of Ada Boost
- Boosting technique learns progressively, it is important to ensure that you have quality data.
- Ada Boost is also extremely sensitive to Noisy data and outliers so if you do plan to use Ada Boost then it is highly recommended to eliminate them.
- Ada Boost has also been proven to be slower than XG Boost.
Practical Applications of Ada Boost
- Ada Boost can be used to solve a variety of real-world problems, such as predicting customer churn and classifying the types of topics customers are talking/calling about.
- The algorithm is heavily utilised for solving classification problems, given its relative ease of implementation in languages such as R and Python.
- Detection Pedestrian Using Patterns of Motion and Appearance
- Research Paper published by Paul Viola, Michael J. Jones, Daniel Snow on IEEE ICCV(International Conference on Computer Vision)
- A pedestrian detection system using image intensity information and motion information with the detectors trained by AdaBoost.
Gradient Boosting
- Another very popular Boosting algorithm is Gradient Boosting. Just like Ada Boost, Gradient Boosting works by sequentially adding predictors to an ensemble, each one correcting its predecessor. However, instead of tweaking the instance weights at every iteration like Ada Boost does, this method tries to fit the new predictor to the residual errors made by the previous predictor.
- The base learners are generated sequentially in such a way that the present base learner is always more effective than the previous one.
- It can be used for both classification and regression problems.
Normally there are three main steps involved for performing Gradient Boost:
- Loss Function that needs to be ameliorated.
- Weak Learner for computing predictions and forming strong learners.
- An Additive Model that will regularize the loss function
Loss Function
- The loss function used depends on the type of problem being solved.
- For example, regression may use a squared error and classification may use logarithmic loss.
- It must be differentiable, but many standard loss functions are supported and you can define your own.
Weak Learner
- Decision trees are used as the weak learner in gradient boosting.
- Trees are constructed in a greedy manner, choosing the best split points based on purity scores like Gini or to minimize the loss.
- Initially, such as in the case of Ada Boost, very short decision trees were used that only had a single split, called a decision stump. Larger trees can be used generally with 4-to-8 levels.
- It is common to constrain the weak learners in specific ways, such as a maximum number of layers, nodes, splits or leaf nodes.
- This is to ensure that the learners remain weak, but can still be constructed in a greedy manner.
Additive Model
- Trees are added one at a time, and existing trees in the model are not changed.
- A gradient descent procedure is used to minimize the loss when adding trees.
- Instead of parameters, we have weak learner sub-models or more specifically decision trees. After calculating the loss, to perform the gradient descent procedure, we must add a tree to the model that reduces the loss (i.e. follow the gradient). We do this by parameterizing the tree, then modify the parameters of the tree and move in the right direction by reducing the residual loss.
- The output for the new tree is then added to the output of the existing sequence of trees in an effort to correct or improve the final output of the model.
- A fixed number of trees are added or training stops once loss reaches an acceptable level or no longer improves on an external validation dataset.
Steps Involved in Gradient Boosting
Step 1:
- Gradient Boost starts by making a single leaf, instead of a tree or stump. This leaf represents a initial guess for the weights of all the samples.
- Usually while trying to predict a continuous value the first guess is the average value.
- In, regression we can easily calculate the average of all the values but in classification we do this by taking log of odds.
- And, for using these log of odds we will be converting these log of odds to probability by using Logistic Function.
Feature 1 | Feature 2 | Observed Value | Predicted Value 1 | Residual 1 |
X11 | X12 | Y1 | Yhat1 | Y1 – Yhat1 |
X21 | X22 | Y2 | Yhat1 | Y2 – Yhat1 |
- Then, Gradient boost tries to calculate the residuals by subtracting the predicted value by the actual value of the dependent variable.
- One thing to remember is that sometimes we call these residuals as pseudo residuals. The term pseudo residual is based on linear regression where the difference between observed and predicted values results in residuals. The “pseudo” part of pseudo residual is a reminder that we are doing Gradient Boost and not linear regression.
Step 3:
- Now, Gradient Boost will construct a decision tree using the input features and these residuals.
- It will try to fit all of these features according to the output of first residuals.
- And, when it passes these input features again to this decision tree it will be getting a new set of residuals by using the same formula i.e. Observed value – Predicted Value.
Note : Gradient Boost has a tendency to overfit the dataset. So, to solve this we normally multiply the residual with the learning rate α (alpha) and the add it to the previous model’s output value.
Feature 1 | Feature 2 | Observed Value | Predicted Value 1 | Residual 1 | Predicted Value 2 | Residual 2 |
X11 | X12 | Y1 | Yhat1 | Y1-Yhat1 | Yhat2 | Y1-Yhat2 |
X21 | X22 | Y2 | Yhat1 | Y1-Yhat1 | Yhat2 | Y1-Yhat2 |
X31 | X32 | Y3 | Yhat1 | Y1-Yhat1 | Yhat2 | Y1-Yhat2 |
Step 4:
- Now, Gradient Boost will be creating another decision tree with the output of the second residuals and they will be giving another set of residuals. Like this it will be building more and more decision tress and the process goes on and on.
- As you see that the residuals will keep on decreasing as more and more decision trees are made. And our main aim is to basically reduce this residual error to some extent.
- At last the addition of the outputs of all the sequential decision tree happens and that will be the final output of Gradient Boosting Algorithm.
F(x) = h0(x) + α1h1(x) + α2h2(x) + . . . . . + αnhn(x)
- In classification problem we cannot directly add the outputs as the predictions are in terms of log of odds and the leaf is derived from probability. So, we have to perform some sort of transformation.
Mathematics Behind Gradient Boost
- Calculating log(odds) for classification : log(Samples with positive output / Samples with negative output)
- Converting log(odds) to probability using Logistic Function for classification : elog(odds) / 1+elog(odds)
- Most common formula used for transformation in gradient boost classification:
∑ Residual / ∑ [Previous Probability * (1 – Previous Probability)]
- Addition of outputs of all sequential trees:
F(x) = h0(x) + α1h1(x) + α2h2(x) + . . . . . + αnhn(x)
Note : α ranges from 0 to 1. And it will be decided with the help of hyperparameter tuning.
Where,
- h0(x) is the output given by base model.
- h1(x) is the output of the first decision tree.
- h2(x) is the output of the second decision tree.
- Similarly, hn(x) is the output of the nth decision tree.
- α1, α2 , ….. , αn are the learning rates.
The more simplified formula : ∑αihi(x)
Where, i = 1 to i=n
- The figure represents the predictions of these three trees in the left column, and the ensemble’s predictions in the right column. In the first row, the ensemble has just one tree, so its predictions are exactly the same as the first tree’s predictions. In the second row, a new tree is trained on the residual errors of the first tree. On the right you can see that the ensemble’s predictions are equal to the sum of the predictions of the first two trees. Similarly, in the third row another tree is trained on the residual errors of the second tree. You can see that the ensemble’s predictions gradually get better as trees are added to the ensemble.
Advantages of Gradient Boost
- Often provides predictive accuracy that cannot be trumped.
- Lots of flexibility – can optimize on different loss functions and provides several hyper parameter tuning options that make the function fit very flexible.
- No data pre-processing required – often works great with categorical and numerical values as is.
- Handles missing data – imputation not required.
- A benefit of the gradient boosting framework is that a new boosting algorithm does not have to be derived for each loss function that may want to be used, instead, it is a generic enough framework that any differentiable loss function can be used
Disadvantages of Gradient Boost
- Gradient Boosting Models will continue improving to minimize all errors. This can overemphasize outliers and cause overfitting.
- Computationally expensive – often require many trees (>1000) which can be time and memory exhaustive.
- The high flexibility results in many parameters that interact and influence heavily the behavior of the approach (number of iterations, tree depth, regularization parameters, etc.). This requires a large grid search during tuning.
- Less interpretative in nature, although this is easily addressed with various tools.
Practical Applications of Gradient Boost
- Neurobotics – Gradient boosting is a useful practical tool for predictive tasks, and consistently provides higher accuracy results compared to conventional single strong machine learning models.
- For example, gradient boosting helps to create models that can map the EMG and EEG sensor readings to human movement tracking and activity recognition.
- Gradient boosting can be used in the field of learning to rank. The commercial web search engines Yahoo and Yandex use variants of gradient boosting in their machine-learned ranking engines.
- Gradient boosting is also utilized in High Energy Physics in data analysis. At the Large Hadron Collider (LHC), variants of gradient boosting Deep Neural Networks (DNN) were successful in reproducing the results of non-machine learning methods of analysis on datasets used to discover the Higgs boson.
Xtreme Gradient Boosting
- Gradient boosting is an ensemble method that sequentially adds our trained predictors and assigns them a weight. However, instead of assigning different weights to the classifiers after every iteration, this method fits the new model to new residuals of the previous prediction and then minimizes the loss when adding the latest prediction. So, in the end, you are updating your model using gradient descent and hence the name, gradient boosting. This is supported for both regression and classification problems. XGBoost specifically, implements this algorithm for decision tree boosting with an additional custom regularization term in the objective function.
- In recent years the main driving force behind the algorithms that won massive ML competitions is Xtreme Gradient Boost or Xgboost.
Evolution of XG Boost Algorithm from Decision Trees
Steps Involved in XG Boost
Step 1:
- The very first step in fitting XG boost to the training data is to make initial prediction.
- This can be anything regardless you are using classification or regression.
- In regression generally it takes the average of the values and in classification by taking log of odds
- And then it will calculate the residuals just like we did in Gradient boosting by subtracting the predicted value from the observed value.
Step 2:
- Once, it has calculated the residuals it will make the trees based on features and residuals.
- Always remember in XG boost when we construct a decision tree it should always be a binary classifier i.e. the leaf node is always only two.
Step 3:
- After building the tree now XG boost will be calculating the similarity weight for each leaf and the root node.
- Both classification and regression problems have different formulas for calculating the similarity weight.
For Classification, ∑(Residual)2 / ∑(Prob(1-Prob)+ƛ)
For Regression, ∑(Residual)2 / No. of Residuals + ƛ
- It calculates the similarity weight to decide whether the feature it is taking as the root node is the best record to split or not.
- After calculating the similarity weight the XG boost will now calculate the gain of the tree and select the tree with the maximum gain and it splits from that record.
Calculating gain : SW of right leaf + SW of the left leaf – SW of the root
Step 4:
- After selecting the best split the XG boost will go for next feature.
- And repeat the same process of calculating the similarity weight and then the gain of the tree and selecting the tree with the maximum gain.
- And this process of making the trees will go on by selecting different features every time.
- At last the addition of the outputs of all the sequential decision tree happens and that will be the final output of Xtreme Gradient Boosting Algorithm.
F(x) = h0(x) + α1h1(x) + α2h2(x) + . . . . . + αnhn(x)
Mathematics Behind XG Boost
- Calculating residuals : Observed Value – Predicted value
- Calculating similarity weight :
For Classification, ∑(Residual)2 / ∑(Prob(1-Prob)+ƛ)
For Regression, ∑(Residual)2 / No. of Residuals + ƛ
- Calculating gain : SW of right leaf + SW of the left leaf – SW of the root
- Addition of outputs of all sequential trees:
F(x) = h0(x) + α1h1(x) + α2h2(x) + . . . . . + αnhn(x)
Post Prunning
- Whether XG boost will go for further splitting or not is decided by the concept known as Post Prunning.
- In classification this is calculated with the help of cover value.
What is cover value?
Remember the formula for calculating similarity weight, whatever we had taken in the denominator i.e. Pr(1-Prob) is the cover value.
- If gain which we are getting is less than cover value than we will cut the branch.
- In regression we have one parameter known as Ɣ(gamma).
- If the value gain – gamma is a negative value we can post prunne the tree i.e. we can cut that particular tree and if it is a positive value we should not cut the tree.
- The gamma is a hyperparameter and we should use this only when it is overfitting. In most of the scenarios gamma is a default value.
Advantages of XG Boost
- Regularization: XG Boost has in-built L1 (Lasso Regression) and L2 (Ridge Regression) regularization which prevents the model from overfitting. That is why, XG Boost is also called regularized form of GBM (Gradient Boosting Machine)
- Parallel Processing: XG Boost utilizes the power of parallel processing and that is why it is much faster than GBM. It uses multiple CPU cores to execute the model.
- Handling Missing Values: XGBoost has an in-built capability to handle missing values. When XGBoost encounters a missing value at a node, it tries both the left and right hand split and learns the way leading to higher loss for each node. It then does the same when working on the testing data.
- Cross Validation: XGBoost allows user to run a cross-validation at each iteration of the boosting process and thus it is easy to get the exact optimum number of boosting iterations in a single run. This is unlike GBM where we have to run a grid-search and only a limited values can be tested.
- Effective Tree Pruning: A GBM would stop splitting a node when it encounters a negative loss in the split. Thus it is more of a greedy algorithm. XG Boost on the other hand make splits upto the max_depth specified and then start pruning the tree backwards and remove splits beyond which there is no positive gain.
Disadvantages of XG Boost
- Like any other boosting method, XGB is sensitive to outliers.
- Unlike LightGBM, in XGB, one has to manually create dummy variable/ label encoding for categorical features before feeding them into the models.
- The biggest limitation is probably the black box nature. If you need effect sizes, XG Boost won’t give them to you (though some ada boost-type algorithms can give that to you). You’d have to derive and program that part yourself.
- It overfits the model if the model is not stopped to early.
Practical Applications of XG Boost
- Application of XG Boost Algorithm in Fingerprinting Localisation Task
- The research paper was published in Computer Information Systems and Industrial Management at 16th IFIP TC8 International Conference by Marcin Luckner, Bartosz Topolski, Magdalena Mazurek
- An Indoor Positioning System (IPS) issues regression and classification challenges in form of an horizontal localisation and a floor detection. They proposed to apply the XG Boost algorithm for both tasks. The algorithm uses vectors of Received Signal Strengths from Wi–Fi access points to map the obtained fingerprints into horizontal coordinates and a current floor number.
- Predicting Sales volume using XG Boost
- The article was written by Omer Faruk Aslantas in which he showed one of the most efficient ways of forecasting your sales data with the XGBoost library of Python