Minimun Edit Distance

1. Edit distance from string to string
2. Align gene sequence
- 2.1. Test to myself

1 Edit distance from string to string

The edit distance from string str1 to string str2 is used to measure the cost of modifying str1 to make it equals to str2. It is used in spelling correction and coreference of named entities such as "U.S.A" to "USA".

Define three edit actions: insertion, deletion and substitution; the edit distance is the weighted sum of number of operations. The costs of insertion and deletion are usually set to one while the cost of substitution is set to two since the substitution can be thought as a combination of deletion and insertion.

1.1 Min Edit Distance (MED)

Min Edit Distance (MED) is computed through a dynamic programming process, which is very similar to computing the maximum probability of Hidden Markov model. MED is sometimes called Levenshtein distance. See below the algorithm used to compute the MED.

def med(str1, str2):
    len1, len2 = len(str1)+1, len(str2)+1
    # init the distance table
    D=[[0]*len2 for i in range(len1)]
    D[0]=range(len2)
    for i in range(len1):
        D[i][0]=i

    for i in range(1, len1):
        for j in range(1, len2):
            d1 = D[i-1][j]+1 # deletion
            d2 = D[i][j-1]+1 # insertion
            d3 = 2 if str1[i-1]!=str2[j-1] else 0 # substitution if not equal, 0 else.
            D[i][j] = min([d1, d2, d3])
    return D[len1][len2]

One can interpret the update step as the cost of aligning the $i$th character of str1 to the $j$th character of str2. However, since the strings change all the time, it is not easy to understand the process. I think the step from D[i-1][j] to D[i][j] means when $i-1$ of str1 is aligned to $j$ of str2, we delete the $i$ of str1. The step from D[i][j-1] to D[i][j] means when $i$ of str1 is aligned to $j-1$ of str2, we insert a character so that the $j$ of str2 is aligned from the state that $i$ is aligned. The step from D[i-1][j-1] to D[i][j] means, $i-1$ and $j-1$ is aligned, we substitute $i$ and $j$ if they are different.

Note that even we say D[i][j] means the alignment of $i$ and $j$, there are three concrete conditions: $i$ is deleted, inserted a character for $j$, change the value of $i$ to the value of $j$. So it should be explained by "after modification of the $i$ in str1, the modified part of str1 is equal to the part of str2 from start to $j$".

1.2 Get the path

It is very similar to finding the hidden units of Markov Machine. We have to save the pointers from next state to last state during the update. For example, p[i][j]=(i-1,j) when d1 is the minimum.

1.3 Weighted MED

Replacing the weight of deletion, insertion and substitution by a weight related to the character. Those weights are used in both initialization and updating. The weights are computed from statistics. For spelling errors, one can create a table that counts the number of an editing action for each character (to each character).

2 Align gene sequence

Aligning gene sequence, searching overlap and max match pattern can be seen as three variations of the MED problem by modifying the initialization and updating values.

From the point of view of gene alignment, deletion can be seen as inserting a blank space behind the $j$ of str2 and insertion can be seen as a blank space being inserted to str1.

2.1 Test to myself

Can you remember and implement the algorithms used to solve the three questions mentioned in this section?