Frequency, group structure, or percentage share analysis is one of the first tasks of an analyst. The most basic form of statistical description is the table (useful in particular for analysing qualitative variables), which contains both counts and aggregated statistics (shares, sums, averages, etc.). Still, it is a rather unattractive form of result visualisation. This is why we often use plots to present results. This form of variable distribution presentation is much clearer and helps quickly access the analysed values.
But what if a variable has several dozens or several hundreds of categories? How can you assess the share of individual villages in the structure of a region or present the abundance of press vocabulary? Another challenge would be an analysis of phrases used by online store users in a search engine or the topics hotel guests write about in comments.
Are we forced to merge categories or use traditional plots, which are often barely legible (or even completely indecipherable)? In such cases, a very attractive technique of visual presentation of the weight of individual categories comes to the rescue: word cloud, aka tag cloud. Its traditional application is shown below.
Although the word cloud may be associated mainly with the latest forms of visualisation, its origin dates back to the beginning of the twentieth century. That’s right! The word cloud is almost a century old! It was used for the first time by André Breton, a writer, literary critique, and surrealism theoretician to substantiate the conclusions of his research by presenting the impact individual writers had on the birth of this movement. In their original sketch, André Breton and Robert Desnos used the font size to represent the weight of individual authors and colour coding provided an additional classification.
The word cloud available to PS IMAGO PRO users refers back to these classic solutions. We will discuss its possibilities now. Our analysis will be based on Eurostat data on the population size in selected European countries. The thirty-seven states are too many for a table, bar chart, or pie chart but we don’t want to do only a standard TOP 10. This is where the word cloud steps in. Have a look at the visualisation below.
The interpretation is very simple; the larger the font, the larger the frequency or share in the analysed structure. This type of visualisation works perfectly when you need to identify the dominant category. You can easily read the names of the countries with the largest populations. It also helps assess the whole scale of the diversity of variable category count. Another advantage of the word cloud is its relatively concise form: it takes up much less space in a report than the frequency table it reflects, and you don’t need to use several tables or the ‘other’ category. It is also easy to optically divide the countries into large, medium, and small ones. Despite a large number of analysed categories, the word cloud is a relatively clear visualisation. That’s why it works well when analysing a large number of categories, words, or tags. This makes it a particularly attractive tool for analysing textual data.
However, the word cloud has some interpretative traps, which I will illustrate using a simple example. The word cloud is used first and foremost as an attractive form of visualisation and so it is very difficult to read the size relation, in particular as the words are not to scale. An inexperienced recipient may interpret the surface area of the word instead of its height, so that longer words may appear more important. This may be resolved with abbreviated names of categories or codes. Other features that need to be taken into consideration are font and colour. It is a good idea to use only upper case or lower case, avoid fancy fonts, and use a single colour. The perception of the weight of a word may depend on the words that surround it and the distance to the centre of the cloud. The above-mentioned reservations apply to the incorrect interpretation which result from the user giving in to optical illusions, rather than flaws of the cloud itself, which remains a particularly attractive visualisation tool.
Let’s have a closer look at additional options of the word cloud algorithm in PS IMAGO PRO. We will use Eurostat data again[1. This dataset focuses on the gross domestic product in EU member states and candidate states. We will additionally use colours to distinguish EU-12, countries that joined the European Union (after 1995), and countries seeking to become members or otherwise associated with the EU.
PS IMAGO PRO allows the user to use two modes of visualisation: words as shown in Figure 2, and bubbles as shown above. In the latter case, the surface area of the circle is assessed. The use of bubbles resolves the problem of difficult interpretation of words of different lengths. The user may select between a regular frequency analysis, and using an additional aggregated variable and selection of a colour variable, which will affect the colours of individual categories. An interesting effect can be achieved by modifying the order of categories; they are ordered from the centre of the cloud like a snail shell. The available options are: ascending order (the smallest categories in the centre), descending order (the largest categories in the centre), random, alphabetical, and by the colour variable. Bubbles also facilitate the use of labels with the name, count, value, and share in sum or count of the category. This way, the visualisation is more precise.
As you can see, the word cloud is a method of an attractive visualisation of a frequency table or a table with aggregated statistics for individual categories. It works particularly well for variables with large numbers of categories. The user is not limited to the count and sum. They may use other statistics and values of any indices as well. The word cloud will without a doubt add colour to your report. It can also facilitate interesting conclusions not clearly visible in a table or a traditional plot.
 Source of data: Eurostat (https://ec.europa.eu/eurostat/data/database). The data includes 37 countries: European Union member states, candidate states or states that pursue membership, Norway, Iceland, and Switzerland. Bosnia and Herzegovina was excluded due to the lack of data.
This blog is devoted to data collection and analysis with articles that aim to inspire data analysts from across the business world, academia and public sector. Our articles endeavor to inform, educate and entertain with one goal in mind: to show how to transform data into clear, attractive and usable information. We invite you to read and share.