Python, nested loops, matching and performance

I am trying to match a list of lastnames to a list of full names using Python 2.7 and the Levenshtein function. To reduce workload I only match if the first letters are identical (although this doesn't seem to make much of a difference performance-wise). If a match is found the matching word is removed from the full names (to make a subsequent first name matching easier). Both lists contain several ten thousand entries, so my solution is rather slow. How could I speed things up without parsing the fullnames? Here is what I have so far (I have omitted a few if-conditions for cases where the lastnames consist of several words):

import Levenshtein

listoflastnames=(['Jones', 'Sallah'])
listoffullnames=(['Henry', 'Jones', 'Junior'],['Indiana', 'Jones'])


def match_strings(lastname, listofnames):
    match=0
    matchedidx=[]
        for index, nameelement in enumerate(listofnames):        
            if lastname[0]==nameelement [0]:
                if Levenshtein.distance(nameelement, lastname)<2:
                    matchedidx.append(index)
                    match=match+1
    if match==1:
        newnamelist = [i for j, i in enumerate(listofnames) if j not in matchedidx]
    return 1, newnamelist 
return 0, listofnames



for x in listoflastnames:
    for y in listoffullnames:
        match, newlistofnames=match_strings(x,y)
        if match==1:
            #go to first name match...

Any help would be appreciated!

Update: in the meantime I have used the multiprocessing module to let all of my 4 cores handle the issue instead of just one, but the matching still takes a lot of time.

Answers


This simplifies the for loop in the match_string function, but didn't increase the speed noticeably in my tests. The biggest loss is in the two for loops with lastnames and fullnames.

def match_strings(lastname, listofnames):
    firstCaseMatched = [name for name in listofnames if lastname[0] == name[0]]
    if len(firstCaseMatched):
        matchedidx = [index for index, ame in enumerate(firstCaseMatched) if Levenshtein.distance(lastname, name) < 2]
        match = len(matchedidx)
    else:
        match = 0
    if match == 1:
        newnamelist = [i for j, i in enumerate(listofnames) if j not in matchedidx]
        return 1, newnamelist
    return 0, listofnames

You might have to sort the list of known last names, split them into a dict for each starting character. And then match each name in the list of names against that.

Assuming the fullnames list always has the first name as first element. You could limit the comparison to only the other elements.


Need Your Help

Automatic job to check if specific element in html page has been updated

html linux web

I want to 'surveil' a page on a website, and get an alert (preferably an email) each time a specific element on the html page is updated. What is the best way to achieve this? I would like the job ...

Sensor Fusion of Accelerometer and Gyroscope in Unity 3D

android unity3d accelerometer gyroscope sensor-fusion

I am trying to use the phone as a gun aiming with the gyroscope. I calibrate with the phone or tablet in a certain orientation. This will shoot straight. Then depending on the direction the phone is