Correspondence analysis is a technique often thought of as an alternative to a crosstab (contingency table). This is due to the fact that both techniques can be applied when analyzing dependencies between two variables with nominal or ordinal measurement level. In comparison to a crosstab, the main advantage of correspondence analysis is that it allows for a graphic visualization of those dependencies in a so-called perceptual map. That is why it is a technique useful both to analysts and report readers. It is particularly appreciated by persons who seek ways to accelerate exploratory analysis of qualitative data.
The most important goal of correspondence analysis is to determine the coordinates for each qualitative variable category that will be presented as a data point in a scatter chart. It means that, in a certain way, we create a quantitative variable out of a qualitative variable (we give quantifications to specific categories of that variable). This idea is characteristic not only of correspondence analysis, but also of other techniques from the optimal scaling family. Of course, the calculation of the coordinates described above must be done such that the distances between data points in a perceptual map reflect the existing similarities and dissimilarities between categories.
In the two previous articles: From crosstab to correspondence analysis: how exactly are the rows and columns of a crosstab transformed into data points in a chart?, and Perceptual maps: Choosing the correct normalization method, we concentrated mainly on creating and interpreting perceptual maps. We know that the map presents, in two dimensions, phenomena that are multi-dimensional in nature. We accepted the fact that the map disregards other dimensions. The assurance that the two presented dimensions are most important, and the other dimensions may be left out, was enough. In this article, however, I would like to show that, when we perform correspondence analysis, we have at our disposal tools that allow us to conduct diagnostics on the solution applied. Although correspondence analysis is an exploratory technique, it does not mean that the question of quality of the solution is not important. I shall endeavor to convince you that before you publish a perceptual map in a report, you should take a look at the other tables produced by the procedure, besides the visualization.
The fundamental question that should be asked is what percentage of variability of the analyzed variables should be ignored as we dismiss certain dimensions?
We shall use the example of a chocolate manufacturer who wants to determine the optimal time of broadcast for TV ads for specific chocolate varieties. In the article Perceptual maps: Choosing the correct normalization method, we carried out correspondence analysis and interpreted the perceptual map created. Now, we shall focus on the tables created in addition to the perceptual map. Firstly, let us look at the ‘Summary’ table. Some of its elements are already known. Specific rows of the table contain information concerning specific dimensions. The first column presents singular values obtained as a result of decomposition of the input matrix. Moreover, the table presents results of the chi-square test, also already known to us.
Now let us concentrate on the ‘Inertia’ column. Inertia is the measure of scattering of data points. It may be treated as a sort of variance. Dimensions are sorted starting with the one that is mostly responsible for scattering of data points. The first dimension is always the one with the highest inertia, the second – slightly lower inertia, etc. The inertia of each dimension may be calculated by squaring the correct singular value. The sum of inertia of dimensions gives the total inertia of a table. What is interesting, is that total inertia may be calculated using another method – by dividing the chi-square value by the number of cases in our data set (N). There were 500 respondents in our data set, therefore:
However, the inertia measure alone does not say much about a specific dimension. If we compare the inertia value of a given dimension to total inertia, we obtain the proportion of inertia of that dimension in the total inertia of the table. This information may be found in the ‘Inertia proportion – explained’ column. As we can see, in our example, the first dimension is responsible for more than 48% of inertia, and the second – for 28%. Other dimensions are less significant. The third dimension is responsible for approx. 11% of inertia, the fourth – approx. 7%, the fifth – approx. 3.5%, and the sixth – approx. 1.5%.
There are six possible dimensions, as the maximum number of dimensions is calculated using the following formula:
I is the number of row variable categories,
J is the number of column variable categories;
All dimensions, put together, explain 100% of inertia.
If, however, we limit ourselves to the first two dimensions, we will be able to explain approx. 77% of the total inertia (see column ‘Inertia proportion – accumulated’). Is the number big or small? That is debatable. If we took into account the third dimension, we would improve the result to 88%, but it would make interpretation of results far more complicated. Let us stick to two dimensions for now.
Let us now proceed to discussing results in the table ‘Overview of row data points’ (Table 2). Since the table is large, I decided to divide it into three separate parts for the purposes of this article, in order to be able to discuss in detail the content of each column. In the first column, we see information concerning the mass of specific categories. Mass is no less than the proportion of that category in the total number of respondents. As you may see in our example, 30% of all respondents watch television in the evening (this time of day was most often indicated in the survey); 15% of respondents do not watch television at all. In the two subsequent columns, we have obtained coordinates calculated for each data point – thus, it is always possible to replicate the perceptual map manually. Further on, we have the information concerning the inertia of specific data points, i.e. scattering within a given category. Inertia values of each data point sum up to total inertia of a table. The ‘I do not watch TV’ category has the highest inertia.
In subsequent columns, we find important information concerning the proportion of specific data points in the inertia of the first and second dimension. We may check which categories determined the orientation of x and y axes. If we look at that part of the table, we may easily detect outlier categories, if any, that may distort the result. If any data point has a high proportion in a dimension’s inertia, and at the same time its mass is relatively low, we should pay attention to that category. In our example, the highest proportion in the 1st dimension’s inertia belongs to the ‘I do not watch TV’ category, and the category including persons who usually watch television at night has the strongest impact on the 2nd dimension. The only thing that may be a cause for concern is a high importance of ‘non-TV watchers’ category, since the purpose of the analysis was to determine which ads should be broadcast at which times of day. The ‘non-watcher’ category is important, as it shows us which chocolate varieties should not be advertised on TV at all. This category should not, however, dominate the whole result. A solution to that problem may be to define this category as a passive one, a topic that will be discussed in a later article.
Let us now look at three last columns of the discussed result table. They represent proportions of specific dimensions in the inertia of data points. Here we should also consider to what extent two dimensions are able to explain the inertia of specific data points. Let us look, for example, at the ‘At night’ category. The first dimension cannot explain the inertia of that data point; the second one, on the other hand, explains it perfectly. Both dimensions put together explain 88% of that data point’s inertia. The situation of the ‘In the evening’ category is completely different. In that case, two dimensions explain as little as 37% of inertia. Categories entitled ‘In the afternoon’ and ‘In the morning’ are even more poorly represented. Adding a third dimension would probably allow for a better representation of those data points. The ‘I do not watch TV’ category achieved the best score. The first two dimensions explain almost 99% of this category’s inertia. This is another clue indicating that we should limit the influence of that category on the result – so as to allow for a better representation of categories that are more interesting to us in the context of our research problem.
Accordingly, we may view statistics for column data points. As we can see, when we perform correspondence analysis, we obtain a lot of useful indicators, apart from the perceptual map, that allow us to assess the quality of the result obtained. Firstly, we may find out what percentage of total inertia is explained by a given number of dimensions. Secondly, we may perform detailed model diagnostics at the level of specific categories of row or column variables. On the one hand, we may see how different categories influence the result (and at the same time detect categories that distort the image), on the other – check if categories we are particularly interested in are properly rendered by the model.
This blog is devoted to data collection and analysis with articles that aim to inspire data analysts from across the business world, academia and public sector. Our articles endeavor to inform, educate and entertain with one goal in mind: to show how to transform data into clear, attractive and usable information. We invite you to read and share.