Bird
Raised Fist0
NLPml~12 mins

Edit distance (Levenshtein) in NLP - Model Pipeline Trace

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Model Pipeline - Edit distance (Levenshtein)

This pipeline calculates the edit distance between two words. It shows how many changes are needed to turn one word into another. This helps computers understand how similar two words are.

Data Flow - 5 Stages
1Input words
2 words (strings)Receive two words to compare2 words (strings)
"kitten" and "sitting"
2Preprocessing
2 words (strings)Convert words to lowercase and remove spaces2 cleaned words (strings)
"kitten" and "sitting"
3Distance matrix initialization
2 words with lengths m=6, n=7Create a matrix of size (m+1) x (n+1) to store distancesMatrix 7 rows x 8 columns
Matrix with first row 0 to 7 and first column 0 to 6
4Dynamic programming fill
Matrix 7 x 8Fill matrix with minimum edit operations (insert, delete, substitute)Completed matrix with edit distances
Final cell value is 3 for "kitten" vs "sitting"
5Result extraction
Completed matrixRead bottom-right cell for final edit distanceSingle integer value
Edit distance = 3
Training Trace - Epoch by Epoch
N/A
EpochLoss ↓Accuracy ↑Observation
1N/AN/AEdit distance calculation is not a training process but a deterministic algorithm.
Prediction Trace - 4 Layers
Layer 1: Initialize matrix
Layer 2: Fill matrix cell (1,1)
Layer 3: Fill matrix cell (3,4)
Layer 4: Complete matrix
Model Quiz - 3 Questions
Test your understanding
What does the bottom-right cell of the edit distance matrix represent?
AThe number of matching characters
BThe length of the longer word
CThe total number of edits needed to convert one word to another
DThe number of insertions only
Key Insight
Edit distance (Levenshtein) uses a matrix to count the smallest number of changes needed to turn one word into another. This helps machines understand word similarity by considering insertions, deletions, and substitutions.

Practice

(1/5)
1. What does the edit distance (Levenshtein distance) between two words measure?
easy
A. The length difference between two words
B. The minimum number of single-character edits to change one word into the other
C. The number of common letters between two words
D. The number of vowels in both words combined

Solution

  1. Step 1: Understand the definition of edit distance

    Edit distance counts how many changes like insertions, deletions, or substitutions are needed to convert one word into another.
  2. Step 2: Compare options with the definition

    Only The minimum number of single-character edits to change one word into the other correctly describes this minimum number of single-character edits.
  3. Final Answer:

    The minimum number of single-character edits to change one word into the other -> Option B
  4. Quick Check:

    Edit distance = minimum edits [OK]
Hint: Edit distance = smallest edits to match words [OK]
Common Mistakes:
  • Confusing edit distance with common letters count
  • Thinking it measures length difference only
  • Mixing up vowels or letter counts
2. Which of the following Python code snippets correctly initializes a 2D table for computing edit distance between strings s1 and s2 of lengths m and n?
easy
A. table = [[0] * (n + 1) for _ in range(m + 1)]
B. table = [[0] * m for _ in range(n)]
C. table = [[0] * (m + 1) for _ in range(n + 1)]
D. table = [[0] * n for _ in range(m)]

Solution

  1. Step 1: Recall table size for edit distance

    The table size must be (m+1) rows and (n+1) columns to include empty prefixes of both strings.
  2. Step 2: Match code to correct dimensions

    table = [[0] * (n + 1) for _ in range(m + 1)] creates a list with (m+1) rows and each row has (n+1) zeros, which is correct.
  3. Final Answer:

    table = [[0] * (n + 1) for _ in range(m + 1)] -> Option A
  4. Quick Check:

    Table size = (m+1) x (n+1) [OK]
Hint: Table size = (len(s1)+1) x (len(s2)+1) [OK]
Common Mistakes:
  • Swapping m and n in dimensions
  • Forgetting to add 1 for empty string prefix
  • Using wrong list comprehension order
3. What is the edit distance between the words "kitten" and "sitting"?
medium
A. 2
B. 4
C. 3
D. 5

Solution

  1. Step 1: Identify edits from "kitten" to "sitting"

    Changes: substitute 'k' -> 's', substitute 'e' -> 'i', insert 'g' at the end.
  2. Step 2: Count total edits

    There are 3 edits: 2 substitutions and 1 insertion.
  3. Final Answer:

    3 -> Option C
  4. Quick Check:

    Edits counted = 3 [OK]
Hint: Count substitutions + insertions + deletions [OK]
Common Mistakes:
  • Counting only substitutions
  • Missing the final insertion
  • Confusing similar letters as no change
4. Consider this Python code snippet for edit distance calculation. What is the error?
def edit_distance(s1, s2):
    m, n = len(s1), len(s2)
    table = [[0] * (n + 1) for _ in range(m + 1)]
    for i in range(m + 1):
        table[i][0] = i
    for j in range(n + 1):
        table[0][j] = j
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if s1[i] == s2[j]:
                cost = 0
            else:
                cost = 1
            table[i][j] = min(table[i-1][j] + 1, table[i][j-1] + 1, table[i-1][j-1] + cost)
    return table[m][n]
medium
A. The return statement should be table[0][0]
B. The table initialization is incorrect
C. The cost should be 2 when characters differ
D. Indexing s1[i] and s2[j] causes an off-by-one error

Solution

  1. Step 1: Check string indexing in loops

    Strings are 0-indexed, but loops start at 1, so s1[i] and s2[j] skip first character and cause index errors.
  2. Step 2: Correct indexing

    Use s1[i-1] and s2[j-1] to access correct characters for comparison.
  3. Final Answer:

    Indexing s1[i] and s2[j] causes an off-by-one error -> Option D
  4. Quick Check:

    Indexing error = s1[i-1], s2[j-1] needed [OK]
Hint: Remember strings start at index 0, loops at 1 need offset [OK]
Common Mistakes:
  • Using s1[i] instead of s1[i-1]
  • Thinking cost must be 2 for difference
  • Returning wrong table cell
5. You want to find the closest word to "flame" from the list ["frame", "flan", "flame", "blame"] using edit distance. Which word will the algorithm select as closest?
hard
A. "flame"
B. "frame"
C. "blame"
D. "flan"

Solution

  1. Step 1: Calculate edit distances to each word

    Distances: "flame" to "frame" = 1 (substitute 'l'->'r'), to "flan" = 2 (delete 'm' and 'e'), to "blame" = 1 (substitute 'f'->'b'), to "flame" = 0 (same word).
  2. Step 2: Identify minimum distance

    The smallest distance is 0 for "flame" itself, meaning exact match.
  3. Final Answer:

    "flame" -> Option A
  4. Quick Check:

    Exact match distance = 0 [OK]
Hint: Exact match has zero edit distance [OK]
Common Mistakes:
  • Choosing word with one letter difference over exact match
  • Ignoring exact matches
  • Miscounting substitutions