How to boost a decision tree algorithm

Reading the article will take you: 6 minutes.
The decision tree is a popular and effective algorithm used primarily in classification work, but it also serves well in predicting quantitative phenomena.

The decision tree is a popular and effective algorithm used primarily in classification work, but it also serves well in predicting quantitative phenomena. The charm of methods based on decision trees comes mainly from the fact that they present us with a set of convenient decision, or business rules.

A quick reminder: the decision tree is a structure, which may be used to split a large number of observations (records) into smaller consecutive sets using a sequence of simple decision rules that can be defined as ‘if…, then…’. With each split, individual groups increase in homogeneity in relation to the phenomenon of interest, i.e. the dependent variable. It is essential that all rules can be represented graphically as a tree, which facilitates the interpretation of results and understanding of how the algorithm works.

The history of decision trees as algorithms dates back to the 1970s,[1] but a similar approach to classification can be traced back to Carl Linnaeus who divided the natural world into kingdoms, classes, orders, genera and species in the 1730s.[2] Many forms of construction of these models based on various measures have been proposed since then. Today, this classic algorithm is often used as a basis for more sophisticated machine learning algorithms.

Despite their efficiency, decision trees have some downsides, as most algorithms do. One of the key problems with predictions based on decision trees is the instability of results. The decision tree model is sensitive to the specific features of a data set; even a small change to the data can lead to a completely different tree; hence the idea to improve this classic data mining model.

Breiman attempted to resolve the issue of decision tree instability in 1996. He proposed a solution that employed the predictive power of not one but multiple tree models simultaneously and called it bootstrap aggregation, or “bagging”. Bagging involves the creation of a few training subsets from the initial data set. Each of the subsets is selected from the original one by sampling with replacement. With this sampling, some observations are not selected for the training sample, and instead are used to form the validation set (out of the bag). In other words, the idea is to carry out multiple sampling and multiple data divisions into the training and test set. Hence, there is no need to divide the initial set into the training and validation sets as it is an integral part of the algorithm. The next step of the algorithm is to build tree models for each sample. The final result of the prediction, selected by voting, is affected by all trees used to train the model. For classification problems, voting may mean that the final predictor is the one indicated by the largest number of individual trees. By basing the prediction result on indications of numerous trees built on different sets, the data patterns become more universal and are not deceived by the nature of specific data.

bagging

What was wrong with bagging that drove Breiman to create a new algorithm? The answer is the sole mechanism of decision trees. Specifically, if bagging uses precisely the same predictors to build each tree, it may happen that variables are so strongly correlated with the investigated phenomenon that each tree chooses the same variable for the first node, even though other variables may be only slightly less correlated with the dependent variable. This phenomenon will affect further splits, and additionally, the potential use of a large number of variables could make this approach a very complex and time-consuming endeavour. As a result of these limitations, and despite the development of many trees, no satisfactorily stable model can achieve results that can be generalised for data outside the data set at hand because of a similar structure of individual classifiers. .

Breiman reacted to this problem in 2001 with a technique called random forests. The random forest algorithm is one of the most popular and universal machine learning methods. It is capable of solving both regression and classification tasks, as was the case with the single decision tree and bagging.

The principle behind the random forest method is very similar to how bootstrap aggregation works. The difference is, the individual models of decision trees only use a subset of available predictors instead of all of them.

The random forest algorithm first samples with replacement an appropriate number of subsets taking into account observations for model validation as in the case of bagging. The next step is to select, preferably randomly, the right dependent variables for each subset. The final result is decided by voting similarly to bagging. It means the final result of the whole model of random forest is the value predicted by the largest number of individual tree models.

 Source: GlobalSoftwareSupport.com

 Source: GlobalSoftwareSupport.com

The algorithm may look exciting, but there ain’t no such thing as a free lunch. The price for the better model stability is longer computation time, more demanding interpretation of results, and more complex predictive rules than for the traditional (single) tree model.

Now that the issue of predictive model stability has been dealt with, what about prediction performance?

One method of improving predictive model performance is to merge classifiers, or individual models into “ensembles”. An ensemble is a model comprising less complex classifiers used to make the final decision, which can be beneficial on many levels. The general idea behind this type of algorithms is to build a complex model with a strong predictive power based on many weak classifiers. In this context, ‘weak’ classifiers are those not aimed at achieving the maximum performance for a single tree model; hence the trees are not very complex.

This approach can be found in AdaBoost[3], short for Adaptive Boosting. This algorithm is appreciated in analytical circles and has become the model for the growth of other boosting techniques. The fundamental difference between AdaBoost and random forests is that random forests are parallel ensembles where individual trees are independent of each other; quite the opposite of AdaBoost. The approach here is sequential, and subsequent classifiers are closely related.

AdaBoost Adaptive Boosting

The first iteration of the model building process creates the base classifier. The second and next iterations focus on the objects that were misclassified by the previous model. The imperfections of the previous model are corrected by the adjustment of weights in the next iteration. Initially, weights of all records are equal. In subsequent iterations when consecutive tree models are grown, the weights of observations that were misestimated are increased while those of correctly predicted data points are reduced. This way, each consecutive tree is grown on an appropriately balanced training set, which is an alternative solution to the approach used in random forests.  In AdaBoost, each successive tree attempts to classify correctly what was misclassified by its predecessors. Put differently, it tries to eliminate the errors of the previous model.

The sequential approach to model building discussed above has become very popular and is still being developed resulting in ever new machine learning methods. Using this approach and drawing on neural networks, another way to minimise errors of individual models in an algorithm was developed. The fitting method here can be gradient descent, which minimises the residuals, or the difference between the true value and model prediction. When predicting real estate prices, for example, the residual is the difference between the actual price of an apartment and the price predicted by the model.

Classification errors should be as small as possible, naturally. According to the incremental strategy of gradient methods, the quality of the model is improved in consecutive runs. Errors are reduced with each iteration, which could lead to a perfect situation where the final model makes no mistake. However, keep in mind the consequences of such a good fit. In reality, such a model is overfitted as it fits well only its training data. It is then no longer capable of result generalisation, and the model may be of little use for new data. This approach to improving learning performance was developed into XGBoost.

XGBoost (Extreme Gradient Boosting) is one of the most popular data mining methods. It was proposed by Tiangi Chen in 2016. XGBoost uses an ensemble of classifiers that can be decision tree models. In this case, the final decision on the classification of individual observations is affected by all trees in the algorithm as well. Similarly to Friedman’s algorithm, XGBoost employs an incremental strategy as it is less complicated and time-consuming than training all trees in parallel. An innovative feature of XGBoost is regularisation. Regularisation is a kind of penalty for a model with an excessive number of final observation segments or “leaves” of the decision tree. In short, XGBoost controls the complexity of the model. Hence, the general form of XGBoost consists of two parts: The first element, the loss or cost function, minimises the error; the other element, regularisation, prevents overfitting and controls the model’s complexity.

To summarise, there are many algorithms that try to tackle the imperfections of their ancestors, however, no single algorithm is perfect, and each has its drawbacks. Random forests and XGBoost are amongst some of the algorithms that can predict both regressive and classification problems based on ensemble learning. The methods discussed above differ in terms of their goals and problems they focus on; Random forests are intended to improve prediction stability, while XGBoost fosters predictive power. Surely, this is not the last word on improved modelling approaches. Machine learning is growing dynamically thanks to the imperfections of the existing algorithms, new data analysis problems, and new technical capabilities. Which one is the best? The usual answer proves its versatility once more: ‘it depends’.

 

[1] J.R. Quinlan. Machine Learning. University of Sydney

[2] M. J.A. Berry, G. S. Linoff. Data Mining Techniques for Marketing, Sales and Customer Relationship Management.

[3] Schapire and Freund 1990


Rate article:

Share the article on social media