Minimum Edit Distance Reconstruction

I know there are similar answer to this on stack, as well as online, but I feel I'm missing something. Given the code below, we need to reconstruct the sequence of events that led to the resulting minimum edit distance. For the code below, we need to write a function that outputs:

Equal, L, L
Delete, E
Equal, A, A
Substitute, D, S
Insert, T

EDIT: CODE IS UPDATED WITH MY (PARTIALLY CORRECT) SOLUTION

Here is the code, with my partial solution. It works for example I was given ("lead" -> "last"), but doesn't work for the example below ("hint" -> "isnt"). I suspect this is because the first character is equal, which is throwing off my code. Any tips or pointers in the right direction would be great!

def printMatrix(M):
        for row in M:
                print row
        print

def med(s, t):  
        k = len(s) + 1
        l = len(t) + 1

        M = [[0 for i in range(k)] for j in range(l)]
        MTrace = [["" for i in range(k)] for j in range(l)]

        M[0][0] = 0


        for i in xrange(0, k):
                M[i][0] = i
                MTrace[i][0] = s[i-1]

        for j in xrange(0, l):
                M[0][j] = j
                MTrace[0][j] = t[j-1]

        MTrace[0][0] = "DONE"

        for i in xrange(1, k):
                for j in xrange(1, l):

                        sub = 1
                        sub_op = "sub"
                        if s[i-1] == t[j-1]:
                                # equality
                                sub = 0
                                sub_op = "eq"


                        # deletion
                        min_value = M[i-1][j] + 1
                        op = "del"
                        if min_value > M[i][j-1] + 1:
                                # insertion
                                min_value = M[i][j-1] + 1
                                op = "ins"
                        if min_value > M[i-1][j-1] + sub:
                                # substitution
                                min_value = M[i-1][j-1] + sub
                                op = sub_op


                        M[i][j] = min_value
                        MTrace[i][j] = op                        

        print "final Matrix"
        printMatrix(M)
        printMatrix(MTrace)

############ MY PARTIAL SOLUTION

        def array_append(array,x,y):
            ops_string = MTrace[x][y]
            if ops_string == 'ins':
                array.append(("Insert",MTrace[0][y]))
            elif ops_string == 'sub':
                array.append(("Substitute",MTrace[x][0],MTrace[0][y]))
            elif ops_string == 'eq':
                array.append(("Equal",MTrace[x][0],MTrace[0][y]))
            elif ops_string == 'del':
                array.append(("Delete",MTrace[x][0]))


        i = len(s)
        j = len(t)

        ops_array = []
        base = M[i][j]
        array_append(ops_array,i,j)


        while MTrace[i][j] != "DONE":
            base = M[i][j]
            local_min = min(M[i][j-1],M[i-1][j],M[i-1][j-1])
            if base == local_min:
                i = i - 1
                j = j - 1
                array_append(ops_array,i,j)
            elif M[i][j-1] < M[i-1][j]:
                j = j -1
                array_append(ops_array,i,j)
            elif M[i-1][j] < M[i][j-1]:
                i = i - 1
                array_append(ops_array,i,j)
            else:
                i = i - 1
                j = j - 1
                array_append(ops_array,i,j)

        print ops_array
#########

        return M[k-1][l-1]      

print med('lead', 'last')

Answers


It's my opinion that understanding the algorithm more deeply is important in this case. Rather than giving you some pseudocode, I'll walk you through the essential steps of the algorithm, and show you how the data you want is "encoded" in the final matrix that results. Of course, if you don't need to roll your own algorithm, then you should obviously just use someone else's, as MattH suggests!

The Big Picture

This looks to me like an implementation of the Wagner-Fischer algorithm. The basic idea is to calculate the distances between "nearby" prefixes, take the minimum, and then calculate the distance for the current pair of strings from that. So for example, say you have two strings 'i' and 'h'. Let's lay them out along the vertical and horizontal axes of a matrix, like so:

  _ h
_ 0 1
i 1 1

Here, '_' denotes an empty string, and each cell in the matrix corresponds to an edit sequence that takes an input ('' or 'i') to an output ('' or 'h').

The distance from the empty string to any string of length L is L, (requiring L insertions). The distance from any string of length L to the empty string is also L (requiring L deletions). That covers the values in the first row and column, which simply increment.

From there, you can calculate the value of any location by taking the minimum from among the upper, left, and upper-left values, and adding one, or, if the letter is the same at that point in the string, taking the upper-left value unchanged. For the value at (1, 1) in the table above, the minimum is 0 at (0, 0), so the value at (1, 1) is 1, and that's the minimum edit distance from 'i' to 'h' (one substitution). So in general, the minimum edit distance is always in the lower right corner of the matrix.

Now let's do another, comparing is to hi. Here again, each cell in the matrix corresponds to an edit sequence that takes an input ('', 'i', or 'is') to an output ('', 'h', or 'hi').

  _ h i
_ 0 1 2
i 1 1 #
s 2 # #

We begin by enlarging the matrix, using # as a placeholder for values we don't know yet, and extending the first row and column by incrementing. Having done so, we can begin calculating results for positions marked # above. Let's start at (2, 1) (in (row, column), i.e. row-major notation). Among the upper, upper-left, and left values, the minimum is 1. The corresponding letters in the table are different -- s and h -- so we add one to that minimum value to get 2, and carry on.

  _ h i
_ 0 1 2
i 1 1 #
s 2 2 #

Let's move on to the value at (1, 2). Now things go a little differently because the corresponding letters in the table are the same -- they're both i. This means we have the option of taking the value in the upper-left cell without adding one. The guiding intuition here is that we don't have to increase the count because the same letter is being added to both strings at this position. And since the lengths of both strings have increased by one, we move diagonally.

  _ h i
_ 0 1 2
i 1 1 1
s 2 2 #

With the last empty cell, things go back to normal. The corresponding letters are s and i, and so we again take the minimum value and add one, to get 2:

  _ h i
_ 0 1 2
i 1 1 1
s 2 2 2

Here's the table we get if we continue this process for two longer words that start with is and hi -- isnt (ignoring punctuation) and hint:

  _ h i n t
_ 0 1 2 3 4
i 1 1 1 2 3
s 2 2 2 2 3
n 3 3 3 2 3
t 4 4 4 3 2

This matrix is slightly more complex, but the final minimum edit distance here is still just 2, because the last two letters of these two strings are the same. Convenient!

Recreating the Sequence of Edits

So how can we extract the types of edits from this table? The key is to realize that movement on the table corresponds to particular types of edits. So for example, a rightward movement from (0, 0) to (0, 1) takes us from _ -> _, requiring no edits, to _ -> h, requiring one edit, an insertion. Likewise, a downward movement from (0, 0) to (1, 0) takes us from _ -> _, requiring no edits, to i -> _, requiring one edit, a deletion. And finally, a diagonal movement from (0, 0) to (1, 1) takes us from _ -> _, requiring no edits, to i -> h, requiring one edit, a substitution.

So now all we have to do is reverse our steps, tracing local minima from among the upper, left, and upper-left cells back to the origin, (0, 0), keeping in mind that if the current value is the same as the minimum, then we must go to the upper-left cell, since that's the only kind of movement that doesn't increment the edit distance.

Here is a detailed description of the steps you could take to do so. Starting from the lower-right corner of the completed matrix, repeat the following until you reach the upper-left corner:

  1. Look at the neighboring cell to the upper left. If it doesn't exist, go to step 3. If the cell does exist, note the value stored there.
  2. Is the value in the upper-left cell equal to the value in the current cell? If so, then do the following:
    • Record an empty operation (i.e. Equal). No edit was required in this case because the characters at this location are the same.
    • Update the current cell, moving up and left.
    • Return to step 1.
  3. There are many branches here:
    • If there is no cell to the left and no cell above, then you are in the upper-left corner, and the algorithm has completed.
    • If there is no cell to the left, go to step 4. (This will continue in a loop until you reach the upper-left corner.)
    • If there is no cell above, go to step 5. (This will continue in a loop until you reach the upper-left corner.)
    • Otherwise, do a three-way comparison between the cell to the left, the cell to the upper left, and the cell above. Pick the one with the smallest value. If there are multiple candidates, you can pick one at random; they are all valid at this stage. (They correspond to different edit paths with the same total edit distance.)
      • If you picked the cell above, go to step 4.
      • If you picked the cell to the left, go to step 5.
      • If you picked the cell to the upper left, go to step 6.
  4. You are moving up. Do the following:
    • Record a deletion of the input character at the current cell.
    • Update the current cell, moving up.
    • Return to step 1.
  5. You are moving left. Do the following:
    • Record an insertion of the output character at the current cell.
    • Update the current cell, moving left.
    • Return to step 1.
  6. You are moving diagonally. Do the following:
    • Record a substitution of the input character at the current cell in place of the output character at the current cell.
    • Update the current cell, moving up and left.
    • Return to step 1.
Putting it Together

In the example above, there are two possible paths:

(4, 4) -> (3, 3) -> (2, 2) -> (1, 2) -> (0, 1) -> (0, 0)

and

(4, 4) -> (3, 3) -> (2, 2) -> (1, 1) -> (0, 0)

Reversing them, we get

(0, 0) -> (0, 1) -> (1, 2) -> (2, 2) -> (3, 3) -> (4, 4)

and

(0, 0) -> (1, 1) -> (2, 2) -> (3, 3) -> (4, 4)

So for the first version, our first operation is a movement to the right, i.e. an insertion. The letter inserted is h, since we're moving from isnt to hint. (This corresponds to Insert, h in your verbose output.) Our next operation is a diagonal movement, i.e. either a substitution, or a no-op. In this case, it's a no-op because the edit distance is the same at both locations (i.e. the letter is the same). So Equal, i, i. Then a downward movement, corresponding to a deletion. The letter deleted is s, since again, we're moving from isnt to hint. (In general, the letter to insert comes from the output string, while the letter to delete comes from the input string.) So that's Delete, s. Then two diagonal movements with no change in value: Equal, n, n and Equal, t, t.

The result:

Insert, h
Equal, i, i
Delete, s
Equal, n, n
Equal, t, t

Performing these instructions on isnt:

isnt   (No change)
hisnt  (Insertion)
hisnt  (No change)
hint   (Deletion)
hint   (No change)
hint   (No change)

For a total edit distance of 2.

I'll leave the second minimum path as an exercise. Keep in mind that both paths are completely equivalent; they may be different, but they will result in the same minimum edit distance of 2, and so are entirely interchangeable. At any point as you work backwards through the matrix, if you see two different possible local minima, you may take either one, and the final result is guaranteed to be correct

Once you grok all this, it shouldn't be hard to code at all. The key, in cases like this, is to deeply understand the algorithm first. Once you've done that, coding it up is a cinch.

Accumulation vs. Reconstruction

As a final note, you might chose to accumulate the edits as you populate the matrix. In that case, each cell in your matrix could be a tuple: (2, ('ins', 'eq', 'del', 'eq', 'eq')). You would increment the length, and append the operation corresponding to a movement from the minimal previous state. That does away with the backtracking, and so decreases the complexity of the code; but it takes up extra memory. If you do this, the final edit sequence will appear along with the final edit distance in the lower right corner of the matrix.


I suggest you have a look at the python-Levenshtein module. Will probably get you a long way there:

>>> import Levenshtein
>>> Levenshtein.editops('LEAD','LAST')
[('replace', 1, 1), ('replace', 2, 2), ('replace', 3, 3)]

You can process the output from edit ops to create your verbose instructions.


I don't know python, but the following C# code works if that's any help.

public class EditDistanceCalculator
{
    public double SubstitutionCost { get; private set; }
    public double DeletionCost { get; private set; }
    public double InsertionCost { get; private set; }

    public EditDistanceCalculator() : this(1,1, 1)
    {
    }

    public EditDistanceCalculator(double substitutionCost, double insertionCost, double deletionCost)
    {
        InsertionCost = insertionCost;
        DeletionCost = deletionCost;
        SubstitutionCost = substitutionCost;
    }

    public Move[] CalcEditDistance(string s, string t)
    {
        if (s == null) throw new ArgumentNullException("s");
        if (t == null) throw new ArgumentNullException("t");

        var distances = new Cell[s.Length + 1, t.Length + 1];
        for (int i = 0; i <= s.Length; i++)
            distances[i, 0] = new Cell(i, Move.Delete);
        for (int j = 0; j <= t.Length; j++)
            distances[0, j] = new Cell(j, Move.Insert);

        for (int i = 1; i <= s.Length; i++)
            for (int j = 1; j <= t.Length; j++)
                distances[i, j] = CalcEditDistance(distances, s, t, i, j);

        return GetEdit(distances, s.Length, t.Length);
    }

    private Cell CalcEditDistance(Cell[,] distances, string s, string t, int i, int j)
    {
        var cell = s[i - 1] == t[j - 1]
                            ? new Cell(distances[i - 1, j - 1].Cost, Move.Match)
                            : new Cell(SubstitutionCost + distances[i - 1, j - 1].Cost, Move.Substitute);
        double deletionCost = DeletionCost + distances[i - 1, j].Cost;
        if (deletionCost < cell.Cost)
            cell = new Cell(deletionCost, Move.Delete);

        double insertionCost = InsertionCost + distances[i, j - 1].Cost;
        if (insertionCost < cell.Cost)
            cell = new Cell(insertionCost, Move.Insert);

        return cell;
    }

    private static Move[] GetEdit(Cell[,] distances, int i, int j)
    {
        var moves = new Stack<Move>();
        while (i > 0 && j > 0)
        {
            var move = distances[i, j].Move;
            moves.Push(move);
            switch (move)
            {
                case Move.Match:
                case Move.Substitute:
                    i--;
                    j--;
                    break;
                case Move.Insert:
                    j--;
                    break;
                case Move.Delete:
                    i--;
                    break;
                default:
                    throw new ArgumentOutOfRangeException();
            }
        }
        for (int k = 0; k < i; k++)
            moves.Push(Move.Delete);
        for (int k = 0; k < j; k++)
            moves.Push(Move.Insert);

        return moves.ToArray();
    }

    class Cell
    {
        public double Cost { get; private set; }
        public Move Move { get; private set; }

        public Cell(double cost, Move move)
        {
            Cost = cost;
            Move = move;
        }
    }
}

public enum Move
{
    Match,
    Substitute,
    Insert,
    Delete
}

Some tests:

    [TestMethod]
    public void TestEditDistance()
    {
        var expected = new[]
            {
                Move.Delete, 
                Move.Substitute, 
                Move.Match, 
                Move.Match, 
                Move.Match, 
                Move.Match, 
                Move.Match, 
                Move.Insert,
                Move.Substitute, 
                Move.Match, 
                Move.Substitute, 
                Move.Match, 
                Move.Match, 
                Move.Match, 
                Move.Match
            };
        Assert.IsTrue(expected.SequenceEqual(new EditDistanceCalculator().CalcEditDistance("thou-shalt-not", "you-should-not")));

        var calc = new EditDistanceCalculator(3, 1, 1);
        var edit = calc.CalcEditDistance("democrat", "republican");
        Console.WriteLine(string.Join(",", edit));
        Assert.AreEqual(3, edit.Count(m => m == Move.Match)); //eca
    }

Need Your Help

What is R's multidimensional equivalent of rbind and cbind?

r multidimensional-array matrix-multiplication

When working with matrices in R, one can put them side-by-side or stack them top of each other using cbind and rbind, respectively. What is the equivalent function for stacking matrices or arrays in

Hibernate, alter identifier/primary key

hibernate primary-key

I receive the following exception when I'm trying to alter my @ID in an @Entity.