Better Testdata for HClustering

Fri 12 September 2014

Filed under programming

Tags algorithms github python

While I first intended to create the matrices manually, it turned out to take too much time if I wanted to reproduce the exact same error as happened in the unit-tests.

So in order to create the matrices used in the earlier blog-posts, I used the script at the end of this article. I took care to double-check the results however. There are times, where there's a choice of more than one cluster to pick. And as the order is not important, this choice becomes dependent on the concrete implementation, and thus non-deterministic in the unit-tests. This caused a discrepancy in the matrices generated by the script below, and the current implementation of the algorithm in library. If that was the case, I validated the results manually, and they checked out.

Unfortunately though, the data in the existing unit-test do not provoke a difference between UCLUS and average linkage (there are no real outliers). Which is unfortunate. I will (for now) not create a new data-set to verify the difference between UCLUS and average. I trust the current implementation enough to say that it will do the right thing...

Finally, here's the script used to create the test-data:

from __future__ import print_function
from difflib import SequenceMatcher

def mean(numbers):
    Returns the arithmetic mean of a numeric list.
    return float(sum(numbers)) / float(len(numbers))

def median(numbers):
    Return the median of the list of numbers.

    # Sort the list and take the middle element.
    n = len(numbers)
    copy = sorted(numbers)
    if n & 1:  # There is an odd number of elements
        return copy[n // 2]
        return (copy[n // 2 - 1] + copy[n // 2]) / 2.0

def textsim(x, y):
    sm = SequenceMatcher(lambda x: x in ". -", x, y)
    return 1 - sm.ratio()

def simplesim(x, y):
    return abs(x-y)

# sim = textsim
sim = simplesim

def single(a, b):
    return min([sim(ex, ey) for ex in a for ey in b])

def complete(a, b):
    return max([sim(ex, ey) for ex in a for ey in b])

def uclus(a, b):
    return median([sim(ex, ey) for ex in a for ey in b])

def average(a, b):
    return mean([sim(ex, ey) for ex in a for ey in b])

linkage = average

data = [['Lorem'],

data = [[_] for _ in [791, 956, 676, 124, 564, 84, 24, 365, 594, 940, 398, 971,
                      131, 365, 542, 336, 518, 835, 134, 391]]

def matrix(data):
    for i, row in enumerate(data):
        for j, cell in enumerate(data[:i]):
            minsim = linkage(row, cell)
            yield (minsim, (i, j))

def step(data):
    while len(data) > 1:
        sorted_matrix = sorted(matrix(data), key=lambda x: x[0])
        closest_items = sorted_matrix[0]
        a, b = closest_items[1]
        new_element = data.pop(max(a, b)) + data.pop(min(a, b))
        return data

def print_matrix2(data, iteration):
    values = []
    for i, row in enumerate(data):
        simdata = [linkage(row, x) for x in data[:i]]
    if values:
        smallest = min(values)

    print('.. csv-table:: Matrix #{}'.format(iteration))
    print('    :header-rows: 1')
    print('    :stub-columns: 1')
    print('    :delim: :\n')
    print('    :' + ': '.join([', '.join([str(x) for x in _]) for _ in data]))
    for i, row in enumerate(data):
        simdata = [linkage(row, x) for x in data[:i]]
        strsimdata = ['**{:.3f}**'.format(_)
                      if _ == smallest else '{:.3f}'.format(_)
                      for _ in simdata]
        foo = ','.join([str(_) for _ in row])
        bar = ': '.join(strsimdata)
        print('    ' + foo, bar, sep=': ')

i = 1
while len(data) > 1:
    print_matrix2(data, i)
    i += 1
print_matrix2(data, i)


Michel Albert © Michel Albert Powered by Pelican and Twitter Bootstrap. Icons by Font Awesome and Font Awesome More