# A Lesson in Unit Testing

Quite some time back, I wrote a Python library for simple clustering tasks.
Back in the day, I only needed *K-Means* clustering. But I decided I would add
*hierarchical clustering* as well as it could come in handy later on.

Turns out I **messed up** the unit tests, *and* the hierarchical clustering
algorithm. But the error went unnoticed as the unit-tests were not testing one
obvious fact:

Are all the elements from the input present in the output? Andonlythose from the input?

The main cause of this error is because the way I obtained the *expected*
values was complete heresy! I ran the test run on random values, and simply
copy/pasted the results into the expected values. Of course now the test
passed, with the important fact that the tests were **testing against the wrong
values!**

In order to fix address this, I first need some proper test values. So now I do
what I should have done in the first place: **Manually step through the
algorithm.** This article contains my scribbles which I will use as a "mental
strut".

## Hierarchical Clustering

We will start of with a sample of non primitive data types. In this case points on a two-dimensional plane. As distance function we will use the euclidian distance. And, for the first step through, we will use single linkage between clusters. The same scenario will be played through with complete- and uclus-linkage later on. As overall method, we will use a bottom-up approach.

The three aproaches will result in the following dendograms:

### Initial Step

So we have the following points:

a (1, 1) b (9, 1) c (2, 2) d (3, 2) e (9, 2) f (3, 4)

Note

To keep the matrix table more readable, individual points are written in lower-case, while clusters are written in upper-case!

This results in the following initial distance matrix. We will mark the
diagonal with an `x` as we're not interested in comparing a point with
itself. Equally, we are only interested in the absolute distance. So we only
need the values for pairs on ons side of the diagonal.

a | b | c | d | e | f | |
---|---|---|---|---|---|---|

a | x | 8.00 | 1.41 | 2.24 | 8.06 | 3.61 |

b | x | 7.07 | 6.08 | 1.00 | 6.71 | |

c | x | 1.00 | 7.00 | 2.24 | ||

d | x | 6.00 | 2.00 | |||

e | x | 6.32 | ||||

f | x |

### Single Linkage

#### First Step

With this matrix, we see that the first candidates are `[b, e]` and `[c,
d]`. We'll pick `[b, e]` as firt cluster (`A`):

a (1, 1) A (9, 1), (9, 2) level=1.00 c (2, 2) d (3, 2) f (3, 4)

a | A | c | d | f | |
---|---|---|---|---|---|

a | x | 8.00 | 1.41 | 2.24 | 3.61 |

A | x | 7.00 | 6.00 | 6.32 | |

c | x | 1.00 | 2.24 | ||

d | x | 2.00 | |||

f | x |

#### Second Step

The next candidate is `[c, d]` as `B`:

a (1, 1) A (9, 1), (9, 2) level=1.00 B (2, 2), (3, 2) level=1.00 f (3, 4)

Giving us:

a | A | B | f | |
---|---|---|---|---|

a | x | 8.00 | 1.41 | 3.61 |

A | x | 6.00 | 6.32 | |

B | x | 2.00 | ||

f | x |

#### Third Step

Then `[f, B]` as `C`:

a (1, 1) A (9, 1), (9, 2) level=1.00 C (3, 4), ((2, 2), (3, 2) level=1.00) level=2.00

Resulting in:

a | A | C | |
---|---|---|---|

a | x | 8.00 | 1.41 |

A | x | 6.00 | |

C | x |

#### Fourth Step

Then `[a, C]` as `D`:

A (9, 1), (9, 2) level=1.00 D (1, 1), ((3, 4), ((2, 2), (3, 2) level=1.00) level=2.00) level=1.41

Resulting in:

A | D | |
---|---|---|

A | x | 6.00 |

D | x |

Which gives us the final cluster `E` with a level of `6.00`.

The end-result is the following dendogram:

E | +-----------+-----------+ | | | D | | | +-----+-----+ | | | | C | | | | | +----+----+ | | | | | A B | | | | | | +--+--+ +--+--+ | | | | | | | | e b c d f a

### Complete Linkage

Initial distance matrix for reference:

a | b | c | d | e | f | |
---|---|---|---|---|---|---|

a | x | 8.00 | 1.41 | 2.24 | 8.06 | 3.61 |

b | x | 7.07 | 6.08 | 1.00 | 6.71 | |

c | x | 1.00 | 7.00 | 2.24 | ||

d | x | 6.00 | 2.00 | |||

e | x | 6.32 | ||||

f | x |

#### First Step

First iteration is identical, but distance matrix has different values. The subsequent steps will be displayed without aditional explanation, the idea is the same as above, simply using a different linkage function.

a | A | c | d | f | |
---|---|---|---|---|---|

a | x | 8.06 | 1.41 | 2.24 | 3.61 |

A | x | 7.07 | 6.08 | 6.71 | |

c | x | 1.00 | 2.24 | ||

d | x | 2.00 | |||

f | x |

#### Second Step

a | A | B | f | |
---|---|---|---|---|

a | x | 8.06 | 2.24 | 3.61 |

A | x | 7.07 | 6.71 | |

B | x | 2.24 | ||

f | x |

#### Third Step

Note

We now have to make a choice. I have not yet decided on how to handle this
situation to have a detereministic behaviour. My current train of thought
is using python `sets` as data-structure which is unordered. So the
algorithm could return either one here.

For a demonstration, we'll pick `[Ba]` as to have a different result from
sinle linkage...

This will give us:

C | A | f | |
---|---|---|---|

C | x | 8.06 | 3.61 |

A | x | 6.71 | |

f | x |

#### Fourth Step

And finally

D | A | |
---|---|---|

D | x | 8.06 |

A | x |

The end-result is the following dendogram:

E | +-----------+-----------+ | | | D | | | +-----+-----+ | | | | C | | | | | +----+----+ | | | | | A B | | | | | | +--+--+ +--+--+ | | | | | | | | e b c d a f

### Uclus

Initial distance matrix for reference:

a | b | c | d | e | f | |
---|---|---|---|---|---|---|

a | x | 8.00 | 1.41 | 2.24 | 8.06 | 3.61 |

b | x | 7.07 | 6.08 | 1.00 | 6.71 | |

c | x | 1.00 | 7.00 | 2.24 | ||

d | x | 6.00 | 2.00 | |||

e | x | 6.32 | ||||

f | x |

As in complete linkage, the first iteration is identical, but distance matrix has different values. So we will not go into too much detail and will skip the visualisations.

a | A | c | d | f | |
---|---|---|---|---|---|

a | x | 8.03 | 1.41 | 2.24 | 3.61 |

A | x | 7.04 | 6.04 | 6.52 | |

c | x | 1.00 | 2.24 | ||

d | x | 2.00 | |||

f | x |

a | A | B | f | |
---|---|---|---|---|

a | x | 8.03 | 1.83 | 3.61 |

A | x | 6.54 | 6.52 | |

B | x | 2.12 | ||

f | x |

C | A | f | |
---|---|---|---|

C | x | 7.04 | 2.24 |

A | x | 6.52 | |

f | x |

And finally

D | A | |
---|---|---|

D | x | 6.86 |

A | x |

The end-result is the following dendogram:

E | +-----------+-----------+ | | | D | | | +-----+-----+ | | | | C | | | | | +----+----+ | | | | | A B | | | | | | +--+--+ +--+--+ | | | | | | | | e b c d a f