Fri 12 September 2014

Filed under programming

Tags algorithms github python

While I first intended to create the matrices manually, it turned out to take too much time if I wanted to reproduce the exact same error as happened in the unit-tests.

So in order to create the matrices used in the earlier blog-posts, I used the script at the end of this article. I took care to double-check the results however. There are times, where there's a choice of more than one cluster to pick. And as the order is not important, this choice becomes dependent on the concrete implementation, and thus non-deterministic in the unit-tests. This caused a discrepancy in the matrices generated by the script below, and the current implementation of the algorithm in library. If that was the case, I validated the results manually, and they checked out.

Unfortunately though, the data in the existing unit-test do not provoke a difference between UCLUS and average linkage (there are no real outliers). Which is unfortunate. I will (for now) not create a new data-set to verify the difference between UCLUS and average. I trust the current implementation enough to say that it will do the right thing...

Finally, here's the script used to create the test-data:

from __future__ import print_function
from difflib import SequenceMatcher


def mean(numbers):
    """
    Returns the arithmetic mean of a numeric list.
    see: http://mail.python.org/pipermail/python-list/2004-December/294990.html
    """
    return float(sum(numbers)) / float(len(numbers))


def median(numbers):
    """
    Return the median of the list of numbers.
    see: http://mail.python.org/pipermail/python-list/2004-December/294990.html
    """

    # Sort the list and take the middle element.
    n = len(numbers)
    copy = sorted(numbers)
    if n & 1:  # There is an odd number of elements
        return copy[n // 2]
    else:
        return (copy[n // 2 - 1] + copy[n // 2]) / 2.0


def textsim(x, y):
    sm = SequenceMatcher(lambda x: x in ". -", x, y)
    return 1 - sm.ratio()


def simplesim(x, y):
    return abs(x-y)

# sim = textsim
sim = simplesim


def single(a, b):
    return min([sim(ex, ey) for ex in a for ey in b])


def complete(a, b):
    return max([sim(ex, ey) for ex in a for ey in b])


def uclus(a, b):
    return median([sim(ex, ey) for ex in a for ey in b])


def average(a, b):
    return mean([sim(ex, ey) for ex in a for ey in b])


linkage = average


data = [['Lorem'],
        ['ipsum'],
        ['dolor'],
        ['sit'],
        ['amet'],
        ['consectetuer'],
        ['adipiscing'],
        ['elit'],
        ['Ut'],
        ['elit'],
        ['Phasellus'],
        ['consequat'],
        ['ultricies'],
        ['mi'],
        ['Sed'],
        ['congue'],
        ['leo'],
        ['at'],
        ['neque'],
        ['Nullam']]


data = [[_] for _ in [791, 956, 676, 124, 564, 84, 24, 365, 594, 940, 398, 971,
                      131, 365, 542, 336, 518, 835, 134, 391]]


def matrix(data):
    for i, row in enumerate(data):
        for j, cell in enumerate(data[:i]):
            minsim = linkage(row, cell)
            yield (minsim, (i, j))


def step(data):
    while len(data) > 1:
        sorted_matrix = sorted(matrix(data), key=lambda x: x[0])
        closest_items = sorted_matrix[0]
        a, b = closest_items[1]
        new_element = data.pop(max(a, b)) + data.pop(min(a, b))
        data.append(new_element)
        return data


def print_matrix2(data, iteration):
    values = []
    for i, row in enumerate(data):
        simdata = [linkage(row, x) for x in data[:i]]
        values.extend(simdata)
    if values:
        smallest = min(values)

    print('.. csv-table:: Matrix #{}'.format(iteration))
    print('    :header-rows: 1')
    print('    :stub-columns: 1')
    print('    :delim: :\n')
    print('    :' + ': '.join([', '.join([str(x) for x in _]) for _ in data]))
    for i, row in enumerate(data):
        simdata = [linkage(row, x) for x in data[:i]]
        strsimdata = ['**{:.3f}**'.format(_)
                      if _ == smallest else '{:.3f}'.format(_)
                      for _ in simdata]
        foo = ','.join([str(_) for _ in row])
        bar = ': '.join(strsimdata)
        print('    ' + foo, bar, sep=': ')


i = 1
while len(data) > 1:
    print_matrix2(data, i)
    print('\n')
    step(data)
    i += 1
print('\n')
print_matrix2(data, i)
print('\n')
Comment

Wed 10 September 2014

Filed under programming

Tags algorithms github python

Here we go again...

After further analysis (the trial & error kind), I found out that the error of modified values only occurs in unsorted data. Which makes perfect sense as Python's list.remove picks the first matching item. So the ordering of data plays a role.

As a quick fix …

Read More

Tue 09 September 2014

Filed under programming

Tags algorithms github python

Steps through the failing data.

This is a thumbnail of a visualisation of the separate clustering steps. Open original SVG file for better readability.

The linked image shows the individual steps through the existing clustering algorithm. Each "horizontal line" (the Y-Axis) represents one step of the algorithm. The distance between each "line" represents the level …

Read More


Michel Albert © Michel Albert Powered by Pelican and Twitter Bootstrap. Icons by Font Awesome and Font Awesome More