Clustering list of list with Strings

So I have a my data set currently looks like the following:

['microsoft','bizspark'],
['microsoft'],
['microsoft', 'skype'],
['amazon', 's3'],
['amazon', 'zappos'],
['amazon'],

.... etc.

Now what I would love to do is cluster these in regards to one another, using the Levenstein distance to calculate word scores.

Now I would iterate through all of the lists and compare the distance to the following lists.

microsoft -> ['microsoft','bizspark'], ['microsoft'], ['microsoft', 'skype'],
amazon -> ['amazon', 's3'], ['amazon', 'zappos'], ['amazon'], ....

The question is how to do this? Should I calculate each levenstein distance on a word by word basis i.e. for ['amazon', 'zappos'] and ['microsoft','bizspark'], I would firstly get pairs: (amazon, microsoft), (amazon, bizspark), (zappos, microsoft, (zappos, bizspark) and calculate the distance of each pair.

Or should I really just create strings from these and then calculate the distance?

What I should then end up with is an NXN matrix with the distance:

                            ['microsoft','bizspark'] | ['amazon', 'zappos'] ....
    ['microsoft','bizspark']           1             |          ?
    _-------------------------------------------------------------------------
    ['amazon', 'zappos']               ?             |          1
            ...
            ....

Then how do I apply clustering to this to determine a cut-off threshold?

One such suggestion using single words is discussed here

But I'm not sure how to go about it with regards to word lists!?

Please note, in regards to implementation I am using Python libaries, such as Numpy, Scipy, Pandas and as needed.

Answers


What you match against probably depends primarily on what your goals are. If you want to match either word, you probably should match against both words separately. If you want to match against phrases, then ' '.join()'ing them is probably a good idea.

BTW, I recently did some fuzzy matching using difflib.get_close_matches(). It's in the Python Standard Library. I don't have anything against whatever Levenstein distance library you may use; I just wanted to point out this option as one that worked for me.


Maybe "frequent itemset mining" is more what you looking for than clustering.

It will find frequent word combinations, such that each document may be part of multiple patterns.


Need Your Help

Is there an algorithm to determine if one or two images can be presented on screen? (singe page vs double page)

image algorithm image-resizing

Given a known number of images with the same width to height ratio and dimensions, is there an algorithm that determines the best way to present them in a screen whose resolution may vary? Aka arra...

SWTBot Could not find node with text: Dynamic Web Project

java junit eclipse-plugin swtbot

I am using the SWTBot recorder to test a very simple feature. I just want to create a Dynamic Web Project and then delete it. But it won't even create the project. Why is it blowing up here?