Bird
Raised Fist0
NLPml~20 mins

Edit distance (Levenshtein) in NLP - Practice Problems & Coding Challenges

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Challenge - 5 Problems
🎖️
Edit Distance Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
Predict Output
intermediate
1:30remaining
Output of Levenshtein distance function
What is the output of this Python function call?

```python def levenshtein(s1, s2): if len(s1) < len(s2): return levenshtein(s2, s1) if len(s2) == 0: return len(s1) previous_row = range(len(s2) + 1) for i, c1 in enumerate(s1): current_row = [i + 1] for j, c2 in enumerate(s2): insertions = previous_row[j + 1] + 1 deletions = current_row[j] + 1 substitutions = previous_row[j] + (c1 != c2) current_row.append(min(insertions, deletions, substitutions)) previous_row = current_row return previous_row[-1] print(levenshtein("kitten", "sitting")) ```
NLP
def levenshtein(s1, s2):
    if len(s1) < len(s2):
        return levenshtein(s2, s1)
    if len(s2) == 0:
        return len(s1)
    previous_row = range(len(s2) + 1)
    for i, c1 in enumerate(s1):
        current_row = [i + 1]
        for j, c2 in enumerate(s2):
            insertions = previous_row[j + 1] + 1
            deletions = current_row[j] + 1
            substitutions = previous_row[j] + (c1 != c2)
            current_row.append(min(insertions, deletions, substitutions))
        previous_row = current_row
    return previous_row[-1]

print(levenshtein("kitten", "sitting"))
A4
B6
C5
D3
Attempts:
2 left
💡 Hint
Count the minimum edits needed to change 'kitten' into 'sitting'.
🧠 Conceptual
intermediate
1:00remaining
Understanding Levenshtein distance properties
Which of the following statements about Levenshtein distance is TRUE?
ALevenshtein distance is always equal to the length difference between two strings.
BLevenshtein distance is a measure of semantic similarity between words.
CLevenshtein distance can only be zero or positive, never negative.
DLevenshtein distance counts only substitutions, ignoring insertions and deletions.
Attempts:
2 left
💡 Hint
Think about what the distance measures and if it can be negative.
Hyperparameter
advanced
1:30remaining
Choosing weights in weighted Levenshtein distance
In a weighted Levenshtein distance, you assign different costs to insertions, deletions, and substitutions. If you want to penalize substitutions more than insertions and deletions, which weight setting is correct?
AInsertion=2, Deletion=2, Substitution=2
BInsertion=1, Deletion=1, Substitution=3
CInsertion=3, Deletion=3, Substitution=1
DInsertion=0, Deletion=0, Substitution=5
Attempts:
2 left
💡 Hint
Higher cost means more penalty.
🔧 Debug
advanced
1:30remaining
Identify the error in Levenshtein distance code
What error will this code raise when called with levenshtein('a', 'b')?

```python def levenshtein(s1, s2): if len(s1) < len(s2): return levenshtein(s2, s1) if len(s2) == 0: return len(s1) previous_row = [0] * len(s2) for i, c1 in enumerate(s1): current_row = [i + 1] for j, c2 in enumerate(s2): insertions = previous_row[j + 1] + 1 deletions = current_row[j] + 1 substitutions = previous_row[j] + (c1 != c2) current_row.append(min(insertions, deletions, substitutions)) previous_row = current_row return previous_row[-1] ```
NLP
def levenshtein(s1, s2):
    if len(s1) < len(s2):
        return levenshtein(s2, s1)
    if len(s2) == 0:
        return len(s1)
    previous_row = [0] * len(s2)
    for i, c1 in enumerate(s1):
        current_row = [i + 1]
        for j, c2 in enumerate(s2):
            insertions = previous_row[j + 1] + 1
            deletions = current_row[j] + 1
            substitutions = previous_row[j] + (c1 != c2)
            current_row.append(min(insertions, deletions, substitutions))
        previous_row = current_row
    return previous_row[-1]

levenshtein('a', 'b')
AIndexError: list index out of range
BTypeError: unsupported operand type(s) for +: 'int' and 'str'
CReturns 0 incorrectly
DNo error, returns 1
Attempts:
2 left
💡 Hint
Check the length of previous_row and how indexes are accessed.
Model Choice
expert
2:00remaining
Best model for approximate string matching using edit distance
You want to build a system that suggests corrections for misspelled words by finding dictionary words with smallest edit distance. Which model or approach is best suited for this task?
AUse a BK-tree data structure with Levenshtein distance for fast approximate matching
BTrain a deep neural network to classify words as correct or incorrect spellings
CUse a bag-of-words model with cosine similarity to find closest words
DApply k-means clustering on word embeddings to group similar words
Attempts:
2 left
💡 Hint
Think about efficient search for words within a certain edit distance.

Practice

(1/5)
1. What does the edit distance (Levenshtein distance) between two words measure?
easy
A. The length difference between two words
B. The minimum number of single-character edits to change one word into the other
C. The number of common letters between two words
D. The number of vowels in both words combined

Solution

  1. Step 1: Understand the definition of edit distance

    Edit distance counts how many changes like insertions, deletions, or substitutions are needed to convert one word into another.
  2. Step 2: Compare options with the definition

    Only The minimum number of single-character edits to change one word into the other correctly describes this minimum number of single-character edits.
  3. Final Answer:

    The minimum number of single-character edits to change one word into the other -> Option B
  4. Quick Check:

    Edit distance = minimum edits [OK]
Hint: Edit distance = smallest edits to match words [OK]
Common Mistakes:
  • Confusing edit distance with common letters count
  • Thinking it measures length difference only
  • Mixing up vowels or letter counts
2. Which of the following Python code snippets correctly initializes a 2D table for computing edit distance between strings s1 and s2 of lengths m and n?
easy
A. table = [[0] * (n + 1) for _ in range(m + 1)]
B. table = [[0] * m for _ in range(n)]
C. table = [[0] * (m + 1) for _ in range(n + 1)]
D. table = [[0] * n for _ in range(m)]

Solution

  1. Step 1: Recall table size for edit distance

    The table size must be (m+1) rows and (n+1) columns to include empty prefixes of both strings.
  2. Step 2: Match code to correct dimensions

    table = [[0] * (n + 1) for _ in range(m + 1)] creates a list with (m+1) rows and each row has (n+1) zeros, which is correct.
  3. Final Answer:

    table = [[0] * (n + 1) for _ in range(m + 1)] -> Option A
  4. Quick Check:

    Table size = (m+1) x (n+1) [OK]
Hint: Table size = (len(s1)+1) x (len(s2)+1) [OK]
Common Mistakes:
  • Swapping m and n in dimensions
  • Forgetting to add 1 for empty string prefix
  • Using wrong list comprehension order
3. What is the edit distance between the words "kitten" and "sitting"?
medium
A. 2
B. 4
C. 3
D. 5

Solution

  1. Step 1: Identify edits from "kitten" to "sitting"

    Changes: substitute 'k' -> 's', substitute 'e' -> 'i', insert 'g' at the end.
  2. Step 2: Count total edits

    There are 3 edits: 2 substitutions and 1 insertion.
  3. Final Answer:

    3 -> Option C
  4. Quick Check:

    Edits counted = 3 [OK]
Hint: Count substitutions + insertions + deletions [OK]
Common Mistakes:
  • Counting only substitutions
  • Missing the final insertion
  • Confusing similar letters as no change
4. Consider this Python code snippet for edit distance calculation. What is the error?
def edit_distance(s1, s2):
    m, n = len(s1), len(s2)
    table = [[0] * (n + 1) for _ in range(m + 1)]
    for i in range(m + 1):
        table[i][0] = i
    for j in range(n + 1):
        table[0][j] = j
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if s1[i] == s2[j]:
                cost = 0
            else:
                cost = 1
            table[i][j] = min(table[i-1][j] + 1, table[i][j-1] + 1, table[i-1][j-1] + cost)
    return table[m][n]
medium
A. The return statement should be table[0][0]
B. The table initialization is incorrect
C. The cost should be 2 when characters differ
D. Indexing s1[i] and s2[j] causes an off-by-one error

Solution

  1. Step 1: Check string indexing in loops

    Strings are 0-indexed, but loops start at 1, so s1[i] and s2[j] skip first character and cause index errors.
  2. Step 2: Correct indexing

    Use s1[i-1] and s2[j-1] to access correct characters for comparison.
  3. Final Answer:

    Indexing s1[i] and s2[j] causes an off-by-one error -> Option D
  4. Quick Check:

    Indexing error = s1[i-1], s2[j-1] needed [OK]
Hint: Remember strings start at index 0, loops at 1 need offset [OK]
Common Mistakes:
  • Using s1[i] instead of s1[i-1]
  • Thinking cost must be 2 for difference
  • Returning wrong table cell
5. You want to find the closest word to "flame" from the list ["frame", "flan", "flame", "blame"] using edit distance. Which word will the algorithm select as closest?
hard
A. "flame"
B. "frame"
C. "blame"
D. "flan"

Solution

  1. Step 1: Calculate edit distances to each word

    Distances: "flame" to "frame" = 1 (substitute 'l'->'r'), to "flan" = 2 (delete 'm' and 'e'), to "blame" = 1 (substitute 'f'->'b'), to "flame" = 0 (same word).
  2. Step 2: Identify minimum distance

    The smallest distance is 0 for "flame" itself, meaning exact match.
  3. Final Answer:

    "flame" -> Option A
  4. Quick Check:

    Exact match distance = 0 [OK]
Hint: Exact match has zero edit distance [OK]
Common Mistakes:
  • Choosing word with one letter difference over exact match
  • Ignoring exact matches
  • Miscounting substitutions