k-means

For a first article, we'll see an implementation in Matlab of the so-called k-means clustering algorithm. K-means algorithm is a very simple and intuitive unsupervised learning algorithm. Indeed, with supervised algorithms, the input samples under which the training is performed are labeled and the algorithm's goal is to fit the training samples as close as possible (without overfitting...). With unsupervised algorithm, input samples are not labeled and the system is in charge to label them by itself. Specifically, the k-means algorithm takes some input data and try to group them into k (that's the meaning of "k" !) clusters in which each sample is closer to its center (the mean of the cluster) than to the center of other clusters. It is not a fully intelligent algorithm in the sense that the user must guess the number of clusters, however there are various techniques to auto-detect the best number of clusters to use. Various improvement have been proposed, but we'll detail only a simple version.

The algorithm is as follow : Given the data and the number k of clusters to find, choose randomly k points in the data to be the temporary centroids. For each point, evaluate the distance to each centroid (it can be any relevant measure such as the euclidian norm for 2D or 3D points) and assign the points to the nearest centroid. We have then k clusters, and for each cluster we compute the centroid (that's the meaning of "mean" in the algorithm name). The next k centroid will then be the nearest data point of the computed centroid. The operation is then repeated until convergence (for example, we can compute a cost function from the sum of the distance of all point to their respective centroid) or until a given number of iterations has been completed.

Here a simple (and probably non-efficient) implementation in Matlab

 

 

And here is the result (black stars are the initial centers and green ones the computed ones)

Res_center
Clustering centers result. Black : original centers, green : computed centers

 

Res_clusters_normalises
Clustering result.

For further details, see the Wikipedia k-means

Leave a Reply

%d bloggers like this: