Clustering list of list with Strings

So I have a my data set currently looks like the following:

```['microsoft','bizspark'],
['microsoft'],
['microsoft', 'skype'],
['amazon', 's3'],
['amazon', 'zappos'],
['amazon'],
```

.... etc.

Now what I would love to do is cluster these in regards to one another, using the Levenstein distance to calculate word scores.

Now I would iterate through all of the lists and compare the distance to the following lists.

```microsoft -> ['microsoft','bizspark'], ['microsoft'], ['microsoft', 'skype'],
amazon -> ['amazon', 's3'], ['amazon', 'zappos'], ['amazon'], ....
```

The question is how to do this? Should I calculate each levenstein distance on a word by word basis i.e. for ['amazon', 'zappos'] and ['microsoft','bizspark'], I would firstly get pairs: (amazon, microsoft), (amazon, bizspark), (zappos, microsoft, (zappos, bizspark) and calculate the distance of each pair.

Or should I really just create strings from these and then calculate the distance?

What I should then end up with is an NXN matrix with the distance:

```                            ['microsoft','bizspark'] | ['amazon', 'zappos'] ....
['microsoft','bizspark']           1             |          ?
_-------------------------------------------------------------------------
['amazon', 'zappos']               ?             |          1
...
....
```

Then how do I apply clustering to this to determine a cut-off threshold?

One such suggestion using single words is discussed here

But I'm not sure how to go about it with regards to word lists!?

Please note, in regards to implementation I am using Python libaries, such as Numpy, Scipy, Pandas and as needed.

What you match against probably depends primarily on what your goals are. If you want to match either word, you probably should match against both words separately. If you want to match against phrases, then ' '.join()'ing them is probably a good idea.

BTW, I recently did some fuzzy matching using difflib.get_close_matches(). It's in the Python Standard Library. I don't have anything against whatever Levenstein distance library you may use; I just wanted to point out this option as one that worked for me.

Maybe "frequent itemset mining" is more what you looking for than clustering.

It will find frequent word combinations, such that each document may be part of multiple patterns.