# Minimun Edit Distance

## Table of Contents

## 1 Edit distance from string to string

The edit distance from string `str1`

to string `str2`

is used to
measure the cost of modifying `str1`

to make it equals to `str2`

. It
is used in spelling correction and coreference of named entities such
as "U.S.A" to "USA".

Define three edit actions: insertion, deletion and substitution; the edit distance is the weighted sum of number of operations. The costs of insertion and deletion are usually set to one while the cost of substitution is set to two since the substitution can be thought as a combination of deletion and insertion.

### 1.1 Min Edit Distance (MED)

Min Edit Distance (MED) is computed through a dynamic programming process, which is very similar to computing the maximum probability of Hidden Markov model. MED is sometimes called Levenshtein distance. See below the algorithm used to compute the MED.

def med(str1, str2): len1, len2 = len(str1)+1, len(str2)+1 # init the distance table D=[[0]*len2 for i in range(len1)] D[0]=range(len2) for i in range(len1): D[i][0]=i for i in range(1, len1): for j in range(1, len2): d1 = D[i-1][j]+1 # deletion d2 = D[i][j-1]+1 # insertion d3 = 2 if str1[i-1]!=str2[j-1] else 0 # substitution if not equal, 0 else. D[i][j] = min([d1, d2, d3]) return D[len1][len2]

One can interpret the update step as the cost of aligning the $i$th
character of `str1`

to the $j$th character of `str2`

. However, since
the strings change all the time, it is not easy to understand the
process. I think the step from `D[i-1][j]`

to `D[i][j]`

means when
\(i-1\) of `str1`

is aligned to \(j\) of `str2`

, we delete the \(i\) of
`str1`

. The step from `D[i][j-1]`

to `D[i][j]`

means when \(i\) of
`str1`

is aligned to \(j-1\) of `str2`

, we insert a character so that
the \(j\) of `str2`

is aligned from the state that \(i\) is aligned. The
step from `D[i-1][j-1]`

to `D[i][j]`

means, \(i-1\) and \(j-1\) is
aligned, we substitute \(i\) and \(j\) if they are different.

Note that even we say `D[i][j]`

means the alignment of \(i\) and \(j\),
there are three concrete conditions: \(i\) is deleted, inserted a
character for \(j\), change the value of \(i\) to the value of \(j\). So it
should be explained by "after modification of the \(i\) in `str1`

, the
modified part of `str1`

is equal to the part of `str2`

from start to
\(j\)".

### 1.2 Get the path

It is very similar to finding the hidden units of Markov Machine. We
have to save the pointers from next state to last state during the
update. For example, `p[i][j]=(i-1,j)`

when `d1`

is the minimum.

### 1.3 Weighted MED

Replacing the weight of deletion, insertion and substitution by a weight related to the character. Those weights are used in both initialization and updating. The weights are computed from statistics. For spelling errors, one can create a table that counts the number of an editing action for each character (to each character).

## 2 Align gene sequence

Aligning gene sequence, searching overlap and max match pattern can be seen as three variations of the MED problem by modifying the initialization and updating values.

From the point of view of gene alignment, deletion can be seen as
inserting a blank space behind the \(j\) of `str2`

and insertion can be
seen as a blank space being inserted to `str1`

.

### 2.1 Test to myself

Can you remember and implement the algorithms used to solve the three questions mentioned in this section?