What is K-Means in Clustering in Machine Learning?

We are going to explain the basic concept of k-means clustering and the k-means clustering algorithm.

Table of Content

What is K-Means Clustering
How it Works
Steps in K-Means Clustering
Assignment of Clusters
Understanding the Formula
More Explanation of the K-Means Algorithm
Example of K-Means Clustering

1. What is k-Means Clustering

It is a clustering method that tends to partition your data into partitions called clusters. Let’s say you have you have n data points, the k-means algorithms would assign each of the data point to the cluster with the nearest mean.

Note: the k in k-means represents the number of clusters, that is k clusters.

2. How it works

Suppose we have a set of observations {x₁, x₂,… x_n} which consists in a set of N random variable x (x is a D d-dimensional real vector). The goal is to partition the data set into some number K of clusters, where the value of K is known.
A cluster is a group of data points whose inter-point distances are minimal when compare with distance to points outside the cluster.
The first step is to find the m_k, for k = 1,…, K, in which m_k is the mean associated to the k_th cluster.
We now assign each of the data points to clusters, such that the sum of squares of the distances of each data point to its closest mean m_k is minimum.

3. Steps of the k-Means Algorithm

Step 1: For each unit, x, in the input space, place it in the cluster whose current centroid it is nearest to.
Step 2: After all points have been assigned (x1,…xn), adjust the locations of the centroids of the k clusters
Step 3: Reassign all the points to their nearest centroid
Repeat steps 2 and 3 until convergence
Convergence occurs when the points not move between clusters and the centroids are stabilized

This algorithm would become clearer when we consider an example later in this article.
Let’s now look at a little more details on how the clusters are assigned

4. Assignment of Clusters

(based on “Pattern Recognition and Machine Learning” by Christopher M. Bishop)
For each data point x_n, we would use a corresponding set of binary variables we denoted as r_nk ={0,1} where k =1,…,K.
r_nk describes which of the K clusters the data point x_n is assigned to. If r_nk is assigned to k, then r_nk = 1 and r_nj = 0 for j not equal to k.

Note the the two superscript in indicates that there would be two loops:
Loop 1(1-N): Iterate through all the data points
Loop 2(1-K): Iterate through all the cluster

Now, for each iteration, you need to calculate a value for J which is called the distortion measure and is given by the sum of squares function:

5. Lets try to understand this formula

First note that this formula is sometimes called distortion measure since it represent how much each data point say x_k is separated from the mean m_k of that cluster.
In other words, J represents the sum of squares of the distance of each data point to its assigned vector m_k.

N is to total number of data points,
K is the number of clusters
x_n is the vector of measurement n
m_k is the mean for cluster k
r_nk is an indicator variable that indicates whether to assign x_n to k

We need to determine the value of {r_nk} and mk that gives the least value of J.
This is achieved by the use of k-Means Algorithm

6. More Explanation of the k=Means Algorithm

This algorithm is an iterative procedure involving steps:
Input set of points x₁,…x_n
Choose a value for K
Place m₁,…, m_k(let’s call this centroid) are random locations within your data space
Repeat the following steps until convergence
for each point x

find the nearest centroid, c_j (compute the distance between xi, compute the distance between x_i and m_j for every centroid m_j, and pick the cluster with the minimum distance)
find the nearest centroid
assign this point x_j to the cluster of the nearest centroid

For each cluster j = 1,…, K, and recompute the centroid position

Take all the data points that fall into this cluster
new centroid c_j = mean of all points x
assign to cluster j in the previous step

Stop when none of the cluster assignments change

7. Example of k-Means Clustering

The example below is based on
K = 2 and N = 14
Take some time to examine the progression through the steps and make sure you understand how it works

0 0 votes

Article Rating

4 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

Machine Learning Questions and Answers (Questions 11 to 20) — The Tech Pro

7 years ago

[…] k-Means clustering […]

martian

6 years ago

Ꮤe’re a group of voⅼunteers and starting a new scheme in our community.
Your site offered us with valᥙable information to work on. You’ve done an impressіve job
and our entire community ᴡіll be grateful to you.

20 Cool Machine Learning and Data Science Concepts (Simple Definitions) - The Genius Blog

5 years ago

[…] K-means Clustering […]

Machine Learning Questions and Answers (Questions 11 to 20) – My Blog

-1

1. What is k-Means Clustering

2. How it works

3. Steps of the k-Means Algorithm

4. Assignment of Clusters

5. Lets try to understand this formula

6. More Explanation of the k=Means Algorithm

7. Example of k-Means Clustering

kindsonthegenius

You might also like

Understanding Clustering (k-Means) in Machine Learning (Unsupervised Learning)

What is an Activation Function in Neural Networks

Data Analytics(DA) Becomes Intelligent Data Analytics(IDA)