The standard approach to investigating a single categorical variable involves a few elements: first, looking into how often individual responses were selected; second, how many instances of missing values there are and what type; third, looking into the percentage distribution of responses (both including and excluding missing values).
When looking at the distribution of responses, for example by running a crosstab, you would usually not take into account missing values, focusing instead on valid observations only. However, you may well then ask if the fact that missing values are overlooked is an intentional action, a result of forgetfulness, or a habit? Furthermore, when they are overlooked what potential new insights are we losing as a consequence?
To better understand what is a regular response percentage and what is the valid response percentage, have a look at Table 1. It shows a distribution of responses to the question: ‘Do you agree that one can become addicted to gambling?’.
According to the table, the most common response was ‘strongly agree’; it was selected by 1215 respondents, which is 60.6% of the 2006 participants. The ‘agree’ response was popular as well. What is interesting, the third most popular response was ‘undecided’; it was selected by 110 people (5.5% of all the respondents). Now, the analyst can follow two paths. If the ‘undecided’ response is significant for the research, the percentage should include all respondents (this is the same percentage basis used so far). If, however, the analyst is interested only in those respondents who have an opinion about the question, it is justified to use only valid observations as the percentage basis (1897 in this example). This, of course, changes the percentage values for individual categories. For example, the ‘valid percentage’ statistic for ‘strongly agree’ is over 64%. The approach taken at this stage should be consistent throughout all the analyses.
The next planned analytical stage is to check whether the opinion on the possibility of becoming addicted to gambling depends on whether or not the respondent gambles. Let’s assume we are interested only in responses of respondents who have an opinion. In this case, we will take the percentage of valid responses into account. Now, we can build a standard crosstab without missing values.
Among the respondents who have an opinion on the question in hand, there is a group of 437 people who gamble. The distribution of responses in the gambler group is similar to the one in the non-gambler group. What is interesting, however, is that the share of people convinced that gambling is addictive is slightly higher in the gambler group (65.9% compared to 63.6%). Also, gamblers selected the ‘strongly disagree’ response less often than non-gamblers (1.9% compared to 2.1%). On the whole, although there are some differences between the groups, they appear to be minor. This is borne out by the Chi-square test of independence, which provides no justification for generalisation of the dependence for the whole population (p>0.05).
How would the table look if undecided participants were taken into consideration? Maybe some non-gamblers refrain from gambling because they are sure it is addictive. Or, maybe they have no opinion and are not interested in this activity. Do people who gamble think about it? Do they have an opinion? Or, maybe they evade the question with a safe ‘undecided’?
Table 3 shows how response distribution changes when previously excluded observations with missing values are included.
As it turns out, the share of people unsure about the addictive properties of gambling is greater among non-gamblers (6.1% compared to 3.3%). At the same time, if missing values are taken into consideration, the share of ‘strongly agree’ answers for non-gamblers drops to 59.7%, and in the gambler group to 63.7%. All this may indicate that people who experienced gambling have an opinion on the adverse effects and more readily agree that it is harmful. It would be interesting to verify this hypothesis in a more in-depth research project.
I hope this post and example will help you select the right statistics when reporting survey research. Usually, it is perfectly fine to use percentage based on valid observations because we are interested in respondents who have a specific opinion on the researched matters. In some cases, however, information on missing values facilitates a better understanding of the respondents and opens us more interesting avenues for analyses. In short, if you remove missing values without a second thought, you may be losing something interesting.
This blog is devoted to data collection and analysis with articles that aim to inspire data analysts from across the business world, academia and public sector. Our articles endeavor to inform, educate and entertain with one goal in mind: to show how to transform data into clear, attractive and usable information. We invite you to read and share.