Levenshtein Distance is also called string edit distance. It is a dynamic programming algorithm that finds the number of segments that differ between two strings. In linguistics, it was first used in dialectometry by Kessler (1995) and was later extended by Nerbonne & Wiersma (1997) to use features. Later, Wiersma (2004) extended it to measurements of phonetic correlates such as F2 and F3.
The Wikipedia page on Levenshtein distance gives imperative pseudocode for the algorithm and explains some of the general uses of the algorithm. The functional version of the algorithm is
levenshtein :: String -> String -> (Char -> Int, Char -> Int, Char -> Char -> Int) -> Int
levenshtein s t (ins,del,sub) = lev s t
where lev "" "" = 0
lev s "" = sum ins s
lev "" t = sum del t
lev (c:s) (d:t) = min [ins c + lev s (d:t),
sub c d + lev s t,
del d + lev (c:s) t]Note: lev must be memoised for efficient execution. Here is a version in Python that uses dynamic programming instead of memoisation.
1 def levenshtein(s1,s2,(delete,insert,subst)):
2 table = [range(len(s2)+1)]
3 for i,c1 in enumerate(s1):
4 row = [i+1]
5 for j,c2 in enumerate(s2):
6 row.append(min([table[-1][j+1]+delete(c1),
7 table[-1][j]+subst(c1,c2),
8 row[-1]+insert(c2)]))
9 table.append(row)
10 return table
Here is an efficient functional version that retains only the current and previous rows of the dynamic programming table in memory. It returns only the answer.
1 def levenshtein(s,t,(ins,delt,sub)):
2 def row(prev_row, (i,d)):
3 def best(current_row, ((n_1,n),c)):
4 m_1 = current_row[-1]
5 return current_row + [min(ins(c)+m_1, sub(c,d)+n_1, delt(d)+n)]
6 return reduce(best, zip(window(prev_row, 2), s), [i+1])
7 return reduce(row, enumerate(t), range(len(s)+1))[-1]
8
9 #given the function window:
10 def window(l, n, inc=1):
11 return [l[j:j+n] for j in range(0,len(l),inc) if j+n<=len(l)]
Levenshtein distance has been used to obtain distances between speakers of Irish Gaelic (Kessler), British English (Shackleton), Norwegian (Gooskens), Dutch (Nerbonne and Heeringa) and Bulgarian (Prokic). There are probably others; I have used it to measure the distance between the speech of cochlear implant users who are native speakers of American English. NathanSanders
