Edit distance helps us find how different two words are by counting the smallest number of changes needed to turn one word into the other.
Edit distance (Levenshtein) in NLP
Start learning this pattern below
Jump into concepts and practice - no test required
def levenshtein_distance(word1: str, word2: str) -> int: m, n = len(word1), len(word2) dp = [[0] * (n + 1) for _ in range(m + 1)] for i in range(m + 1): dp[i][0] = i for j in range(n + 1): dp[0][j] = j for i in range(1, m + 1): for j in range(1, n + 1): if word1[i - 1] == word2[j - 1]: dp[i][j] = dp[i - 1][j - 1] else: dp[i][j] = 1 + min(dp[i - 1][j], # deletion dp[i][j - 1], # insertion dp[i - 1][j - 1]) # substitution return dp[m][n]
The function uses a table (dp) to store results of smaller problems.
Each cell dp[i][j] shows the edit distance between first i letters of word1 and first j letters of word2.
distance = levenshtein_distance("cat", "cut") print(distance) # Output: 1
distance = levenshtein_distance("", "dog") print(distance) # Output: 3
distance = levenshtein_distance("book", "books") print(distance) # Output: 1
distance = levenshtein_distance("apple", "apple") print(distance) # Output: 0
This program calculates the edit distance between 'kitten' and 'sitting'. It prints the number of changes needed to turn one word into the other.
def levenshtein_distance(word1: str, word2: str) -> int: m, n = len(word1), len(word2) dp = [[0] * (n + 1) for _ in range(m + 1)] for i in range(m + 1): dp[i][0] = i for j in range(n + 1): dp[0][j] = j for i in range(1, m + 1): for j in range(1, n + 1): if word1[i - 1] == word2[j - 1]: dp[i][j] = dp[i - 1][j - 1] else: dp[i][j] = 1 + min(dp[i - 1][j], # deletion dp[i][j - 1], # insertion dp[i - 1][j - 1]) # substitution return dp[m][n] # Words to compare word_a = "kitten" word_b = "sitting" print(f"Edit distance between '{word_a}' and '{word_b}':") distance = levenshtein_distance(word_a, word_b) print(distance)
Time complexity is O(m*n), where m and n are the lengths of the two words.
Space complexity is also O(m*n) because of the table used.
Common mistake: forgetting to initialize the first row and column of the table, which represent empty strings.
Use edit distance when you want to measure how similar two strings are. For very long strings, consider faster approximate methods.
Edit distance counts the smallest number of changes to turn one word into another.
It helps in spell checking, search, and text comparison.
The dynamic programming method uses a table to build the answer step-by-step.
Practice
Solution
Step 1: Understand the definition of edit distance
Edit distance counts how many changes like insertions, deletions, or substitutions are needed to convert one word into another.Step 2: Compare options with the definition
Only The minimum number of single-character edits to change one word into the other correctly describes this minimum number of single-character edits.Final Answer:
The minimum number of single-character edits to change one word into the other -> Option BQuick Check:
Edit distance = minimum edits [OK]
- Confusing edit distance with common letters count
- Thinking it measures length difference only
- Mixing up vowels or letter counts
s1 and s2 of lengths m and n?Solution
Step 1: Recall table size for edit distance
The table size must be (m+1) rows and (n+1) columns to include empty prefixes of both strings.Step 2: Match code to correct dimensions
table = [[0] * (n + 1) for _ in range(m + 1)] creates a list with (m+1) rows and each row has (n+1) zeros, which is correct.Final Answer:
table = [[0] * (n + 1) for _ in range(m + 1)] -> Option AQuick Check:
Table size = (m+1) x (n+1) [OK]
- Swapping m and n in dimensions
- Forgetting to add 1 for empty string prefix
- Using wrong list comprehension order
"kitten" and "sitting"?Solution
Step 1: Identify edits from "kitten" to "sitting"
Changes: substitute 'k' -> 's', substitute 'e' -> 'i', insert 'g' at the end.Step 2: Count total edits
There are 3 edits: 2 substitutions and 1 insertion.Final Answer:
3 -> Option CQuick Check:
Edits counted = 3 [OK]
- Counting only substitutions
- Missing the final insertion
- Confusing similar letters as no change
def edit_distance(s1, s2):
m, n = len(s1), len(s2)
table = [[0] * (n + 1) for _ in range(m + 1)]
for i in range(m + 1):
table[i][0] = i
for j in range(n + 1):
table[0][j] = j
for i in range(1, m + 1):
for j in range(1, n + 1):
if s1[i] == s2[j]:
cost = 0
else:
cost = 1
table[i][j] = min(table[i-1][j] + 1, table[i][j-1] + 1, table[i-1][j-1] + cost)
return table[m][n]Solution
Step 1: Check string indexing in loops
Strings are 0-indexed, but loops start at 1, so s1[i] and s2[j] skip first character and cause index errors.Step 2: Correct indexing
Use s1[i-1] and s2[j-1] to access correct characters for comparison.Final Answer:
Indexing s1[i] and s2[j] causes an off-by-one error -> Option DQuick Check:
Indexing error = s1[i-1], s2[j-1] needed [OK]
- Using s1[i] instead of s1[i-1]
- Thinking cost must be 2 for difference
- Returning wrong table cell
"flame" from the list ["frame", "flan", "flame", "blame"] using edit distance. Which word will the algorithm select as closest?Solution
Step 1: Calculate edit distances to each word
Distances: "flame" to "frame" = 1 (substitute 'l'->'r'), to "flan" = 2 (delete 'm' and 'e'), to "blame" = 1 (substitute 'f'->'b'), to "flame" = 0 (same word).Step 2: Identify minimum distance
The smallest distance is 0 for "flame" itself, meaning exact match.Final Answer:
"flame" -> Option AQuick Check:
Exact match distance = 0 [OK]
- Choosing word with one letter difference over exact match
- Ignoring exact matches
- Miscounting substitutions
