Clustering is the process of partitioning a set of data objects into subsets or clusters. Each subset (usually called a cluster) contains objects that have high resemblance (each item shares the same properties or features). Data partitioning is an automatic process. Every subset has a distinct characteristic: this makes them different from each other, unlike classification algorithms, where the model attempts to classify (group) the test data based on the labels provided by the training data set. The clustering algorithms perform these grouping without the help of labels.
In this unsupervised process, the algorithm determines the data segmentation based on some factors such as distance between the data elements (this defines the cohesion in each cluster and the boundaries between clusters), knowledge of the research area etc. The distance between the data elements could be Euclidean-based, Information-based or Correlation-based.
This algorithm was used in the early 19th century by a London physician called John Snow. He used this algorithm to analyze the location of cholera deaths on a map during an epidemic in the 1850s. With this approach, he spotted the geographical areas where the outbreak was severe, thereby improving the prevention and eradication of the plague.
In recent times, the clustering technique can be implemented in different business areas to provide solutions, unravel hidden patterns in the data set (knowledge discovery) as well as optimize business processes. In data science applications, clustering can be used to enhance feature engineering. The cost of labelling a large dataset could be time-consuming and expensive hence employing clustering can easily map out features that have a strong correlation with each other (principal component analysis).
There are three main approaches one can employ in any clustering-based task. They are Hierarchical, Partitional and Bayesian Clustering Methods.
In this approach, the algorithm place similar data items into groups in a tree-like arrangement called a dendrogram. This is achieved when the algorithm categorizes two objects (items) that are closest together using the Euclidean distance and create a cluster. It continues this process until all the possible clusters are formed within a dataset.
There are two methods of Hierarchical Clustering. They are:
This is a bottom-up approach in clustering where each item is considered as a unique cluster and at every iteration, these unique clusters are joined to the clusters based on certain features (characteristics) to form big clusters. The advantages of this clustering method are the method is relatively easy to understand and implement as well as the possibility it offers the given dataset a chance to be grouped into a suitable number of clusters however this method is not noise and/or outlier proof.
This is a top-bottom approach of clustering where at the first iteration, the dataset that contains all the items is considered as a supercluster. After the first iteration, the dataset is split into N clusters based on the common features (characteristics) shared by the items within the dataset. While this method is more complex compared to agglomerative clustering, it is more efficient than the former since creating new clusters in a bid on the neighbouring items within a cluster, unlike the agglomerative clustering where all the data items in the dataset are accessed.
Figure 1: A Representation of Agglomerative and Divisive Hierarchical Clustering. Source
Applications for Hierarchical Clustering
Use Study 1
A university wants to group students into high/medium/low risk based on attributes such as Age, Sex, Course of Study, Number of Years Spent, Current CGPA, Ethnicity, Accommodation Type, Scholarship Type, Job Type
Once the segments are identified, the education administrators, parents and other stakeholders can monitor students’ performance This system can serve as an early warning device to flag students who are on the verge of dropping out of school as seen in this research paper authored by Ya-Han Hu and his team. Click here to learn more.
Use Study 2
HELP International is an international humanitarian non-government organization with a commitment to eradicating poverty and providing basic amenities and relief to countries in need, especially during times of disasters and natural calamities. With lots of operational projects and advocacy practices, HELP has raised awareness about its vision and funds to carry out its mission.
After raising around $ 10 million for a recent project, the CEO of the NGO and his team need to decide to use this money effectively. The major issue is how to deliver assistance to countries that are appalling.
The available dataset for this analysis holds selected socio-economic (inflation rate, import and export rate, GDP per capita) and health factors (child mortality rate, life expectation) that can be used to evaluate the country’s development state.
With the proper application of clustering analysis and sound knowledge of socio-economic parameters, one can develop an automated system to measure each country’s development state and place them in defined clusters (developed, developing and undeveloped countries). The clustering algorithm would give us a hint on each feature’s impact on the clusters.