K-means algorithm, part 1: what is it?

I’ve been using this simple means of clustering data points together for awhile, so it’s probably worth discussing. Let’s talk about k-means clustering.

Let’s say we have a 2D matrix of data, where each row corresponds to an object, and each column is a characteristic or attribute of said objects. Let’s say the matrix is $\mathbb{R}^{m,n}$, so we have m objects, each with n characteristics.

We want to cluster these objects into groups the best way we can. So given some positive integer k, we can do a simple clustering algorithm to try and put the objects into k groups, each with points based around a centroid, or a central “average” point.

The inputs at its simplest are the k constant for number of groups, and for the time being, let’s assume we’re using a n-dimensional Euclidean distance measure, given as $d = \sqrt{(x_{1}-y_{1})^{2} + ... + (x_{n}-y_{n})^{2}}$

So in the simplest terms, I describe k-means clustering in the following pseudocode:

1. initialize centroids, say, pick k points randomly to start with as centroids.
2. Find what centroid each object is closest to, by our measure of distance.
3. Recalculate centroids based on the potentially new groupings of points from step 2
4. Check if the centroid list for each group changed; if so, go back to step 2 and continue looping until convergence.

Sounds simple, right? The algorithm simply says these are the starting points, let’s find the closest points to these places, readjust our centroids, and then find the closest points again.

There are some things that need to be addressed when using the k-means clustering algorithm though.

• How are centroids initialized? Randomly made groups? Pre-clustering?
• What value of k makes sense here? Evaluate quality of clustering results?
• Alternative distance measures? Convergence?
• Is k-means even suitable for clustering the data I have?

While the algorithm above simply initializes the centroids by picking points, there are definitely alternative ways to do it. This part is extremely important because, assuming convergence even happens, we end up at an answer, but it’s very likely that if we choose different starting points, we’ll arrive at another answer that’s not quite the same.

The evaluation of what k to use is another huge problem; if we do not know how many groups there should be, then we need a way to evaluate what value of k gives us the best results, best being somewhat subjective, depending on what you want to see.

While the Euclidean distance measure is by far the easiest to deal with, there are other distance measures that can work, and generally work better or differently on some data sets. Euclidean distance tends to make round globular clusters (as you can imagine like an L2-norm unit ball), so there are cases where the distance function may alleviate that. K-means in general may not be the right answer if you do not want globular clusters.

Finally, the whole convergence problem. The Euclidean distance guarantees us convergence; the problem settings into a local critical point like a convex optimization problem, thanks to the squaring of differences. Choosing other distance measures may not guarantee us convergence anymore.

I hope to cover all of these issues eventually in other posts, but in the meantime, here’s an example of what k-means gets you when used on uniformly distributed points, k = 4. The code I threw together in a couple minutes for doing so is given below in a code block.

function kmeans_demo()
k = 4;
num_points = 500;
% create 100 data points.
data = rand(num_points, 2);
% do k-means clustering with a given k
[groups, centroids] = k_means(data, k);
% plot the data, no line, points only, different colors.
color_char = 'rgbm';
figure;
for k_ind = 1:k
plot( data(groups==k_ind, 1), data(groups==k_ind,2), ['x', color_char(k_ind)] );
hold on;
plot( centroids(k_ind, 1), centroids(k_ind, 2), ['*', color_char(k_ind)], 'LineWidth', 3 );
hold on;
end
title( 'k-means, Euclidean distance, k=4' );
xlabel( 'x' );
ylabel( 'y' );
end

function [groups, centroids] = k_means( data, k )
% initialize centroids
converge_flag = 0;
num_points = size(data, 1);
groups = ceil(k.*rand( num_points, 1) );
for k_ind = 1:k
centroids(k_ind, : ) = mean( data( groups==k_ind, : ), 1 );
end
distances = zeros( num_points, k );
groups = zeros( num_points, 1);
% start looping
counter = 0;
while converge_flag ~= 1 && counter < 50
counter = counter + 1;
old_groups = groups;
% assign points to groups
for k_ind = 1:k
distances(:, k_ind) = sqrt( sum(bsxfun( @minus, data, centroids(k_ind,: ) ).^2, 2) );
end
% find the smallest distance point
[y,groups] = min( distances, [], 2 );
if all(old_groups == groups)
converge_flag = 1;
end
% recalculate centroids
for k_ind = 1:k
centroids(k_ind, : ) = mean( data( groups==k_ind, : ), 1 );
end
end
% return the groups we have, get out.
end