```
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from yellowbrick.cluster import KElbowVisualizer
X, y = make_blobs(centers=8, n_features=12, shuffle=True, random_state=42)
model = KMeans()
visualizer = KElbowVisualizer(model, k=(4,12))
visualizer.fit(X)
visualizer.poof()
```

# What is the difference between K-mean and EM? How to decide k

Carvia Tech | May 22, 2019 | 2 min read | 126 views

## K-mean and Gaussian mixture model: what is the difference between K-mean and EM? How to decide k?

**Answer** :

Process of K-Means is something like assigning each observation to a cluster and process of EM(Expectation Maximization) is finding likelihood of an observation belonging to a cluster(probability). This is where both of these processes differ.

Excerpt from *ISLR* :
For instance, suppose that most of the observations truly belong to a small number of (unknown) subgroups, and a small
subset of the observations are quite different from each other and from all other observations.
Then since K- means force every observation into a cluster, the clusters found may be heavily distorted due to the
presence of outliers that do not belong to any cluster. Mixture models are an attractive approach for accommodating the
presence of such outliers. These amount to a soft version of K-means clustering.

Other assumption of k-means is finding centroid of a cluster by using *mean* which means it assumes that clusters must be
circular in motion. In EM, we assume that data points could be gaussian distributed.

*How to decide k?*

*How to decide k?*

k here means the optimal number of clusters.

When it comes to deciding **k in k-means**, we can use elbow method.

Here by default, scoring parameter metric is set to distortion, which computes the sum of squared distances from each
point to its assigned centroid. However, two other metrics can also be used with the `KElbowVisualizer`

– silhouette and
`calinski_harabaz`

. The `silhouette score`

calculates the mean Silhouette Coefficient of all samples, while the
`calinski_harabaz`

score computes the ratio of dispersion between and within clusters.

The `KElbowVisualizer`

also displays the amount of time to train the clustering model per K as a dashed green line,
but is can be hidden by setting `timings=False`

. So we can choose `8`

as value of k by conclusion from chart above.

Choosing `k in Gaussian mixture models`

can also be started as we did in k-means but since the process is dependent on
getting maximum likelihood. So we choose k and randomly initialize gaussian parameters for each cluster.
Then we calculate probability of each point belonging to that cluster and fine tune the model. This is how, slowly we
reach the point of optimization for model.

###### Top articles in this category:

- Google Data Scientist interview questions with answers
- Introduction to regression, correlation, multi collinearity and 99th percentile
- Explain a probability distribution that is not normal and how to apply that
- Top 100 interview questions on Data Science & Machine Learning
- When using Gaussian mixture model, how do you know it is applicable
- Difference between Loss, Accuracy, Validation loss, Validation accuracy in Keras
- Python coding challenges for interviews

###### Find more on this topic:

##### Machine Learning

Data science, machine learning, python, R, big data, spark, the Jupyter notebook, and much more

Last updated 1 week ago

###### Recommended books for interview preparation:

##### Similar Posts

- Configure Logging in gunicorn based application in docker container
- Connect to Cassandra with Python 3.x and get Pandas Dataframe
- Connect to MySQL with Python 3.x and get Pandas Dataframe
- Connect to Postgresql with Python 3.x and get Pandas Dataframe
- Python - Get Google Analytics Data
- Installing PySpark with Jupyter notebook on Ubuntu 18.04 LTS
- Python send GMAIL with attachment
- Send rich text multimedia email in Python
- Blueprints in Flask API Development
- Singleton Design Pattern in Python