Regression trees are a very interesting data analysis technique commonly used in tasks related to poststratification, forecasting, and segmentation. It is also a very useful technique for data exploration, identifying the structure of relationships among variables, and finding the best predictors. A set of clear rules are visualised in tree form which split the dataset into smaller segments with different mean values of the predicted variable. This makes regression trees great for estimating customer value, forecasting consumer shopping value, or predicting the duration of website visits. They can be an extension of linear regression or variance analysis techniques.
The previous post on regression trees focused on the principles of interpreting analysis results using consumer shopping value as an example. It is time to have a closer look at the stages of analysis and how the CRT algorithm works.
The first thing you will notice about CRT trees is the binary split; where possible, the parent node is always split into two new segments. This makes the tree division much deeper than in the case of CHAID classification trees.
What does the process look like? The regression tree algorithm analyses the relationship of each independent variable with the dependent variable and looks for the optimum binary split. In the case of qualitative variables, categories are merged taking the level of measurement into consideration: for nominal variables, all-to-all amalgamations are tested, while for ordinal variables, only neighbouring categories are merged. In the case of quantitative variables, the CRT algorithm searches for the optimum distribution split in two. Hence, every predictor is split in all possible ways, and the single best split is selected.
Remember that the regression tree algorithm uses the mean and the sum of squares of deviations from the mean computed in variance analysis as the measure of dispersion. New nodes are built in such a way as to make subgroups as internally homogeneous as possible and as externally different in terms of the value of the predicted variable as possible.
With the independent variable in the picture, the dispersion of the dependent variable can be split in two:
The intergroup dispersion represents the variability in question accounted for by the assignment to a category of the independent variable. The intergroup dispersion represents the error or the part of variability the predictor failed to account for. The greater the difference between the intragroup dispersion and the total dispersion (not considering the additional variable), the better the new predictor is at explaining the dispersion of the dependent variable.
In the case of CRT trees, the reduction of dispersion around the mean is called an improvement and branches are split by the predictor yielding the best reduction results compared to the parent node (the best improvement). The improvement value is reported on the decision tree diagram below the parent node. Have a look at the first split. The analysed dataset was split with respect to the target of the purchase. The first node contains people who bought things for family, and the second includes those who shopped for themselves or friends.
Let’s calculate the improvement ‘manually’. We will use the ANOVA table from the Means procedure (ANALYSE -> COMPARE MEANS -> MEANS). It contains sums of squares of the dependent variable (purchase value) divided into intra- and intergroup dispersions, which are needed to do the significance test. Have a look at the Sum of squares column. The accounted variability is represented in the Intergroup row, and the total variability (not taking Target into consideration) is in the Total row.
To obtain the improvement, divide the sum of intergroup squares by the size of the dataset, 351 cases (NOTE: the ANOVA table reports degrees of freedom, not the number of observations). The intergroup sum of squares represents the accounted variability and the reduction of dispersion of the variable at the same time because it is equal to the difference between the total sum of squares and the intragroup sum of squares (the one to be accounted). It is an actual improvement of the quality of forecasting based on the intergroup mean compared to the forecast based solely on the total mean.
You can calculate the improvement for each split the same way. Select observations based on the partition made by the CRT algorithm and then calculate the intergroup sum of squares divided by the total number of cases.
The quality of the proposed solution is evaluated using the Risk table. For classification trees, it contains the percentage of incorrectly classified observations in the Evaluation column. Unfortunately, the information in the table is difficult to interpret directly in the case of regression trees as it describes the dispersion of values within the final identified segments.
It becomes clearer if you take another look at the ANOVA table. This time, the independent variable is the number of the node the observation was classified to. The value of the transaction is the dependent variable once again.
The variability investigated by the decision tree is shown in the Intergroup row, the unaccounted variability, in the Intragroup row. If you divide the intragroup sum of squares by the number of observations, you get the risk value shown in the previous table. It is, therefore, the measure of error (mean square of error) made when estimated means within leaves are used to forecast the purchase value.
Now, we will have a look at the variability accounted for by the tree. If you divide the intergroup sum of squares by the total sum of squares, you get the percentage of variability of the forecast variable accounted for by the CRT tree. It is 60.1%. If you add the values forecast by the regression tree into the dataset (means for individual terminal nodes) and run a linear regression between the primary variable and forecast values, the R-squared will be exactly 0.601.
How does a decision tree use predictors? Parent nodes are split using the predictor with the highest value of improvement. Note that each node is split independently as if the analysis was run consecutively for cases in each subgroup identified with an increasingly complex filter. Hence, the tree may use a different predictor at each tier, which is usually the case. The CRT algorithm may reuse a predictor if it considers a split to be optimal at a next stage.
Another important thing is that CRT decision trees handle missing data in a specific way. As opposed to the CHAID algorithm, which makes missing data a separate category, CRT tries to use a different splitting predictor if it comes across a variable with missing data. It is the best possible split, ensuring the second-best improvement value for the node.
The optimal use of independent variables, however, makes it hard for the user to assess their impact on the assignment to final segments. Relevant information can be found in table Importance of independent variable and its chart.
Figure 5. The impact of individual independent variables on the classification
The table and the chart show the assessment of the impact of individual predictors on the model. The most important predictor has the highest value of importance; the other ones are shown as its percentages. You can see that the most important predictor when assessing the value of the transaction was the target of the purchase, with gender in second with merely 26% of the weight of the target.
Importance values of independent variables are calculated using the improvement. The objects mentioned above draw a general picture, while detailed specifications of the relationships between predictors and the dependent variable are shown in the Substitutes table below.
The first column contains the number of the split node. Further columns show independent variables used in the split. The variable described as Primary is the one used for the split. The table contains Substitutes as well. These are variables used in the case of missing data. The improvement column has Improvement values for each variable and the measure of the strength of the relationship between a substitute and the primary variable. The final importance of a predictor is the sum of its Improvements both as the primary variable and as a substitute. The importance of gender is 56.094 + 30.229 + 160.547 = 246.870, for example.
The Substitutes table offers another essential feature of the method of selecting predictors by CRT. Have a look at the Purchase method. It is not very important for the model (Improvement of only 4% of the value of Purchase target) and is just a substitute as it was not used as a primary variable in any split. On the one hand, this feature allows decision trees to select the most important predictors. Then again, it may be a disadvantage compared to dimensionality reduction techniques. It is possible that a variable is important for a model as a whole but is not used to grow a tree because it always comes in second as a substitute. Interpretation should involve not only the tree diagram but the substitute table as well.
As you can see, the Improvement value for CRT trees can be easily related to the reduction of dispersion of the dependent variable in such analysis techniques as ANOVA or linear regression. I hope I managed to encourage you to use regression trees. Analysts who use classic modelling techniques should not be surprised by the notions and model fitness assessment in decision trees despite differences in result interpretation.
This blog is devoted to data collection and analysis with articles that aim to inspire data analysts from across the business world, academia and public sector. Our articles endeavor to inform, educate and entertain with one goal in mind: to show how to transform data into clear, attractive and usable information. We invite you to read and share.