Understanding Clustering (k-Means) in Machine Learning (Unsupervised Learning)

In this lesson we are going discuss clustering under the following topic:

Introduction to Clustering

Formalized Definition of k-Means Clustering

1-of-K Coding Scheme

The Expectation and Maximization Algorithm

Introduction to Clustering
Clustering is a class of unsupervised learning concept or machine learning whose goal is to identify groups or clusters within datapoints in a multi-dimensional space.
Suppose we have data sets {x₁, x₂, …, x_n} consisting of N observations of a random D-dimensional Euclidean variable x. The goal is to partition the data into K clusters when the value of K is preknown.
We can define a cluster as a group of data points whose inter-point distances are minimal compared with the distance with points that are outside the cluster.

Formalization of k-Means Clustering
Let’s assume a set of D-dimensional vectors μ_k where k = 1, 2, …, K and μ_k represents the centroid of the cluster k. Take centroid to be the mean of data points in the cluster.
Our goal is to find the assignment of data points to clusters as well as a set of vectors { μ_k} such that the sum of squares of distances of each data point to its closest vector μ_k is at a minimum.

1-of-K Coding Scheme
The 1-of-k Coding Scheme is a coding scheme applied to a K-dimensional vector x in which one of the elements x_k equals 1 and all the remaining elements 0. This means that a data point cannot belong to two clusters at the same time.

In this case, for each data point x, we introduce a corresponding set of binary indicator variables r_nk ∈ {0,1}.
If x_n is assigned to μ_k then r_nk = 1 and r_nj = 0 for j ≠ k
We can then define an objective function J know as distortion measure

This represents the sum of the squares of the distances of each of the data points from its assigned vector μ_k
The goal is to choose values for {r_nk} and {μ_k} so as to minimize J. This brings us to the next section where we would examine the algorithm.

The Expectation and Maximization Algorithm(k-Means)
The algorithm is an iterative algorithm where for each iteration, two steps are taken that optimizes the variables r_nk and μ_k

The algorithm goes this way.
First, choose an initial value for μ_k
In the first phase: we minimize J with respect to r_nk while keeping μ_k fixed.
In the second phase, we minimize J with respect to μ_k while keeping r_nk fixed.
Repeat the two steps until convergence.