Minimun Edit Distance
Table of Contents
1 Edit distance from string to string
The edit distance from string str1
to string str2
is used to
measure the cost of modifying str1
to make it equals to str2
. It
is used in spelling correction and coreference of named entities such
as "U.S.A" to "USA".
Define three edit actions: insertion, deletion and substitution; the edit distance is the weighted sum of number of operations. The costs of insertion and deletion are usually set to one while the cost of substitution is set to two since the substitution can be thought as a combination of deletion and insertion.
1.1 Min Edit Distance (MED)
Min Edit Distance (MED) is computed through a dynamic programming process, which is very similar to computing the maximum probability of Hidden Markov model. MED is sometimes called Levenshtein distance. See below the algorithm used to compute the MED.
def med(str1, str2): len1, len2 = len(str1)+1, len(str2)+1 # init the distance table D=[[0]*len2 for i in range(len1)] D[0]=range(len2) for i in range(len1): D[i][0]=i for i in range(1, len1): for j in range(1, len2): d1 = D[i-1][j]+1 # deletion d2 = D[i][j-1]+1 # insertion d3 = 2 if str1[i-1]!=str2[j-1] else 0 # substitution if not equal, 0 else. D[i][j] = min([d1, d2, d3]) return D[len1][len2]
One can interpret the update step as the cost of aligning the $i$th
character of str1
to the $j$th character of str2
. However, since
the strings change all the time, it is not easy to understand the
process. I think the step from D[i-1][j]
to D[i][j]
means when
\(i-1\) of str1
is aligned to \(j\) of str2
, we delete the \(i\) of
str1
. The step from D[i][j-1]
to D[i][j]
means when \(i\) of
str1
is aligned to \(j-1\) of str2
, we insert a character so that
the \(j\) of str2
is aligned from the state that \(i\) is aligned. The
step from D[i-1][j-1]
to D[i][j]
means, \(i-1\) and \(j-1\) is
aligned, we substitute \(i\) and \(j\) if they are different.
Note that even we say D[i][j]
means the alignment of \(i\) and \(j\),
there are three concrete conditions: \(i\) is deleted, inserted a
character for \(j\), change the value of \(i\) to the value of \(j\). So it
should be explained by "after modification of the \(i\) in str1
, the
modified part of str1
is equal to the part of str2
from start to
\(j\)".
1.2 Get the path
It is very similar to finding the hidden units of Markov Machine. We
have to save the pointers from next state to last state during the
update. For example, p[i][j]=(i-1,j)
when d1
is the minimum.
1.3 Weighted MED
Replacing the weight of deletion, insertion and substitution by a weight related to the character. Those weights are used in both initialization and updating. The weights are computed from statistics. For spelling errors, one can create a table that counts the number of an editing action for each character (to each character).
2 Align gene sequence
Aligning gene sequence, searching overlap and max match pattern can be seen as three variations of the MED problem by modifying the initialization and updating values.
From the point of view of gene alignment, deletion can be seen as
inserting a blank space behind the \(j\) of str2
and insertion can be
seen as a blank space being inserted to str1
.
2.1 Test to myself
Can you remember and implement the algorithms used to solve the three questions mentioned in this section?