 Michel Albert

# Better Testdata for HClustering

Fri 12 September 2014

Filed under programming

While I first intended to create the matrices manually, it turned out to take too much time if I wanted to reproduce the exact same error as happened in the unit-tests.

So in order to create the matrices used in the earlier blog-posts, I used the script at the end of this article. I took care to double-check the results however. There are times, where there's a choice of more than one cluster to pick. And as the order is not important, this choice becomes dependent on the concrete implementation, and thus non-deterministic in the unit-tests. This caused a discrepancy in the matrices generated by the script below, and the current implementation of the algorithm in library. If that was the case, I validated the results manually, and they checked out.

Unfortunately though, the data in the existing unit-test do not provoke a difference between UCLUS and average linkage (there are no real outliers). Which is unfortunate. I will (for now) not create a new data-set to verify the difference between UCLUS and average. I trust the current implementation enough to say that it will do the right thing...

Finally, here's the script used to create the test-data:

```from __future__ import print_function
from difflib import SequenceMatcher

def mean(numbers):
"""
Returns the arithmetic mean of a numeric list.
see: http://mail.python.org/pipermail/python-list/2004-December/294990.html
"""
return float(sum(numbers)) / float(len(numbers))

def median(numbers):
"""
Return the median of the list of numbers.
see: http://mail.python.org/pipermail/python-list/2004-December/294990.html
"""

# Sort the list and take the middle element.
n = len(numbers)
copy = sorted(numbers)
if n & 1:  # There is an odd number of elements
return copy[n // 2]
else:
return (copy[n // 2 - 1] + copy[n // 2]) / 2.0

def textsim(x, y):
sm = SequenceMatcher(lambda x: x in ". -", x, y)
return 1 - sm.ratio()

def simplesim(x, y):
return abs(x-y)

# sim = textsim
sim = simplesim

def single(a, b):
return min([sim(ex, ey) for ex in a for ey in b])

def complete(a, b):
return max([sim(ex, ey) for ex in a for ey in b])

def uclus(a, b):
return median([sim(ex, ey) for ex in a for ey in b])

def average(a, b):
return mean([sim(ex, ey) for ex in a for ey in b])

data = [['Lorem'],
['ipsum'],
['dolor'],
['sit'],
['amet'],
['consectetuer'],
['elit'],
['Ut'],
['elit'],
['Phasellus'],
['consequat'],
['ultricies'],
['mi'],
['Sed'],
['congue'],
['leo'],
['at'],
['neque'],
['Nullam']]

data = [[_] for _ in [791, 956, 676, 124, 564, 84, 24, 365, 594, 940, 398, 971,
131, 365, 542, 336, 518, 835, 134, 391]]

def matrix(data):
for i, row in enumerate(data):
for j, cell in enumerate(data[:i]):
minsim = linkage(row, cell)
yield (minsim, (i, j))

def step(data):
while len(data) > 1:
sorted_matrix = sorted(matrix(data), key=lambda x: x)
closest_items = sorted_matrix
a, b = closest_items
new_element = data.pop(max(a, b)) + data.pop(min(a, b))
data.append(new_element)
return data

def print_matrix2(data, iteration):
values = []
for i, row in enumerate(data):
simdata = [linkage(row, x) for x in data[:i]]
values.extend(simdata)
if values:
smallest = min(values)

print('.. csv-table:: Matrix #{}'.format(iteration))
print('    :stub-columns: 1')
print('    :delim: :\n')
print('    :' + ': '.join([', '.join([str(x) for x in _]) for _ in data]))
for i, row in enumerate(data):
simdata = [linkage(row, x) for x in data[:i]]
strsimdata = ['**{:.3f}**'.format(_)
if _ == smallest else '{:.3f}'.format(_)
for _ in simdata]
foo = ','.join([str(_) for _ in row])
bar = ': '.join(strsimdata)
print('    ' + foo, bar, sep=': ')

i = 1
while len(data) > 1:
print_matrix2(data, i)
print('\n')
step(data)
i += 1
print('\n')
print_matrix2(data, i)
print('\n')
```