A Lesson in Unit Testing
Quite some time back, I wrote a Python library for simple clustering tasks. Back in the day, I only needed K-Means clustering. But I decided I would add hierarchical clustering as well as it could come in handy later on.
Turns out I messed up the unit tests, and the hierarchical clustering algorithm. But the error went unnoticed as the unit-tests were not testing one obvious fact:
Are all the elements from the input present in the output? And only those from the input?
The main cause of this error is because the way I obtained the expected values was complete heresy! I ran the test run on random values, and simply copy/pasted the results into the expected values. Of course now the test passed, with the important fact that the tests were testing against the wrong values!
In order to fix address this, I first need some proper test values. So now I do what I should have done in the first place: Manually step through the algorithm. This article contains my scribbles which I will use as a "mental strut".
We will start of with a sample of non primitive data types. In this case points on a two-dimensional plane. As distance function we will use the euclidian distance. And, for the first step through, we will use single linkage between clusters. The same scenario will be played through with complete- and uclus-linkage later on. As overall method, we will use a bottom-up approach.
The three aproaches will result in the following dendograms:
So we have the following points:
a (1, 1) b (9, 1) c (2, 2) d (3, 2) e (9, 2) f (3, 4)
To keep the matrix table more readable, individual points are written in lower-case, while clusters are written in upper-case!
This results in the following initial distance matrix. We will mark the diagonal with an x as we're not interested in comparing a point with itself. Equally, we are only interested in the absolute distance. So we only need the values for pairs on ons side of the diagonal.
Initial distance matrix for reference:
As in complete linkage, the first iteration is identical, but distance matrix has different values. So we will not go into too much detail and will skip the visualisations.
The end-result is the following dendogram:
E | +-----------+-----------+ | | | D | | | +-----+-----+ | | | | C | | | | | +----+----+ | | | | | A B | | | | | | +--+--+ +--+--+ | | | | | | | | e b c d a f