What is the difference between K-mean and EM? How to decide k

Upasana | May 22, 2019 | 2 min read | 126 views

K-mean and Gaussian mixture model: what is the difference between K-mean and EM? How to decide k?

Answer :

Process of K-Means is something like assigning each observation to a cluster and process of EM(Expectation Maximization) is finding likelihood of an observation belonging to a cluster(probability). This is where both of these processes differ.

Excerpt from ISLR : For instance, suppose that most of the observations truly belong to a small number of (unknown) subgroups, and a small subset of the observations are quite different from each other and from all other observations. Then since K- means force every observation into a cluster, the clusters found may be heavily distorted due to the presence of outliers that do not belong to any cluster. Mixture models are an attractive approach for accommodating the presence of such outliers. These amount to a soft version of K-means clustering.

Other assumption of k-means is finding centroid of a cluster by using mean which means it assumes that clusters must be circular in motion. In EM, we assume that data points could be gaussian distributed.

How to decide k?

k here means the optimal number of clusters.

When it comes to deciding k in k-means, we can use elbow method.

Code for visualising elbow method

from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from yellowbrick.cluster import KElbowVisualizer

X, y = make_blobs(centers=8, n_features=12, shuffle=True, random_state=42)

model = KMeans()
visualizer = KElbowVisualizer(model, k=(4,12))

visualizer.fit(X)
visualizer.poof()

Chart

Here by default, scoring parameter metric is set to distortion, which computes the sum of squared distances from each point to its assigned centroid. However, two other metrics can also be used with the KElbowVisualizer – silhouette and calinski_harabaz. The silhouette score calculates the mean Silhouette Coefficient of all samples, while the calinski_harabaz score computes the ratio of dispersion between and within clusters.

The KElbowVisualizer also displays the amount of time to train the clustering model per K as a dashed green line, but is can be hidden by setting timings=False. So we can choose 8 as value of k by conclusion from chart above.

Choosing k in Gaussian mixture models can also be started as we did in k-means but since the process is dependent on getting maximum likelihood. So we choose k and randomly initialize gaussian parameters for each cluster. Then we calculate probability of each point belonging to that cluster and fine tune the model. This is how, slowly we reach the point of optimization for model.