What does a hierarchical cluster achieve

Search

To carry out the hierarchical cluster analysis with the Ward method, we use the data from the World Happiness Report from 2020. The World Happiness Report is an annual report published by the Sustainable Development Solutions Network of the United Nations. The report includes life satisfaction rankings in different countries around the world and data analysis from different perspectives (see Helliwell et al., 2020).

Import of the data:

In this analysis we use the country-specific information on life expectancy in years () and the logarithmic gross domestic product per inhabitant ():

So that the procedure of the hierarchical clustering algorithm can be better understood, we randomly pull 20 countries from the data set:

Representation of the countries in a point diagram:

4.1 Data preparation

4.1.1 Variable selection

We create a new data set which only contains the variables that are to be used for the cluster analysis. In addition, we use the variable in order to be able to label the data in a meaningful way in a later step.

4.1.2 Missing values

We check whether there are any missing values ​​in the data:

There are no missing values ​​in this data set. However, if this should be the case in another project, we could remove these missing values ​​with the command:

4.1.3 Standardization

So that the values ​​of the variables are available in a uniform value interval, we use the z-transformation to standardize the data. With the help of this standardization, the mean value is set to 0 and the standard deviation of the variables is set to 1. The formula for this is:

\ [z = \ frac {x - \ bar {x}} {s} \]

  • \ (\ bar {x} \): Average value of the data
  • \ (s \): Standard deviation of the data

We carry out the standardization with the help of the command and save the new variables in the data set.

As can be seen in the figure, the position of the countries does not change, only the units on the X and Y axes:

4.2 Measure of Proximity

We use the Euclidean distance as a measure of proximity and save the result of the function that calculates the distance between all countries with the name. Since we do not want to include the variable in the calculation, we remove it in the command.

4.3 Hierarchical cluster analysis

The next step is to use the hierarchical cluster analysis with the command. To do this, we pass the data object to the function, which contains the Euclidean distances between the countries (for further information on the function, see this post on stackoverflow).

At the beginning of the agglomerative cluster formation, each country is in its own cluster. In the end, all countries are in a common cluster. The optimal number of clusters is not determined by the algorithm, but has to be determined on the basis of further considerations. When determining the optimal number of clusters, the so-called “cophenetic distance” and the “dendogram” are helpful.

At the beginning of the agglomerative cluster formation, those countries are merged which are closest to each other. This “smallest distance” between two clusters at which the merging takes place can be determined with the “Cophenetic Distance”:

The smallest distance between two clusters is 0.44 at the beginning (if each country is its own cluster). So this was the smallest gap between two countries. Thereafter, the distance increases monotonically, as more and more dissimilar clusters (i.e. with a greater distance from one another) are merged. When the clusters were merged into a single common cluster for the last time, the distance assumes the maximum value of 41. To make the values ​​easier to interpret, the process is usually represented in a so-called dendrogram.

4.4 Dendrogram

The result of the clustering algorithm can be displayed with the aid of the dendrogram. The dendrogram reads from bottom to top and describes the process of clustering in this direction. The vertical axis describes the heterogeneity of the clusters with the aforementioned “Cophenetic Distance” (which is referred to as in the figure). All cases are listed individually on the lower side of the dendrogram. So, first of all, each country corresponds to a cluster, which can be seen from the fact that each case has its own horizontal line. These clusters are gradually merged from bottom to top to form larger clusters. The vertical lines indicate that two clusters are merging.

Representation of the dendrogram:

Use of the country names as labels in the dendrogram:

The “optimal” number of clusters should be based on interpretations of content with a view to the greatest possible plausibility of the clusters formed. In addition, the greatest (or a large) increase in heterogeneity in the dendrogram can be used as a decision criterion. With our data, the greatest increase in heterogeneity occurs between a 2-cluster and 1-cluster solution. The increase in heterogeneity between a 4-cluster and 2-cluster solution is also relatively large. We decide here for a number of clusters of 4, but we could have chosen the 2-cluster solution. As already mentioned, there is often no clear “optimal” solution with this method, since the ability to interpret the clusters on the basis of content considerations also plays an important role.

Representation of the dendrogram with red borders with a size of 4 clusters:

Determination of group membership (cluster 1 to cluster 4) of the respective countries with a cluster size of k = 4. For this we use the function that makes a “cut” for the corresponding cluster size and divides the data into the corresponding groups (number of the cluster) .

Adding the number of the cluster to the record:

Representation of the clusters in a point diagram:

For comparison, here is the breakdown of the data when choosing 2 clusters:

Representation of the 2-cluster solution in a point diagram: