Skip to content

K-Means Clustering

Working Principle

K-Means Clustering is a simple but powerful unsupervised learning algorithm. K-Means Clustering aims to divide a given sample data into "k" groups. The process is intuitively simple:

  1. Place k number centroids of the clusters in random locations in the data space.
  2. For each sample in the data space, associate its class with the nearest centroid.
  3. For each cluster, compute the geometric center and move the centroids there.
  4. Repeat steps 2 and 3 until you don't need to change the location of the centroid.
Automatically compute "k"

There is a way to find the number of clusters(k) you need. This technique utilizes the elbow curve that is generated by creating a line graph of the number of clusters "k" and the corresponding reduction in variation. This is not currently implemented in MAGIST, but there are plans to implement it soon.

Selection_070 1

K-means_convergence 1

Implementation Using MAGIST

Currently, MAGIST uses K-Means Clustering for masking images. It attempts to identify individual objects in an image which can then be further processed to extract more information. For this, you must use the UnsupervisedModels under the Vision directory. It can be imported like so:

from MAGIST.Vision.UnsupervisedModels.img_cluster import RoughCluster # (1)
  1. UnsupervisedModels contains a Python file named img_cluster.py that contains the RoughCluster class. This is where all the necessary methods for computing the clusters.

Note

This computation is done with the help of SciKit-Learn and Pandas.

Now, create an instance of the RoughCluster class:

clusterer = RoughCluster("PATH/TO/CONFIG.json") # (1)
  1. More information about the config file is located here: Setup Config File

Finally, we can compute the cluster:

n_of_clusters = 3
img_location = "Data/test.jpg"
img_size = (200, 200)
masked_img_dir = "Data/Clusters"

masked_img_locations = clusterer.unsupervised_clusters(n_of_clusters, img_location, img_size, masked_img_dir)

Here is what masked_img_locations should look like:

["Data/Clusters/masked0.jpg", "Data/Clusters/masked1.jpg", "Data/Clusters/masked2.jpg"] # (1)
  1. The amount of images listed depends on the n_of_clusters.

These images are the final masked results. Through the config file, you can enable verbose to see the intermediate stage before it individually exporting the images.


  1. These images were acquired from WikiPedia