0
0
NLPml~5 mins

Edit distance (Levenshtein) in NLP

Choose your learning style9 modes available
Introduction

Edit distance helps us find how different two words are by counting the smallest number of changes needed to turn one word into the other.

Checking spelling mistakes and suggesting corrections.
Comparing two names to see if they might be the same person.
Finding similar words in search engines.
Matching user input with stored commands or keywords.
Detecting plagiarism by comparing text similarity.
Syntax
NLP
def levenshtein_distance(word1: str, word2: str) -> int:
    m, n = len(word1), len(word2)
    dp = [[0] * (n + 1) for _ in range(m + 1)]

    for i in range(m + 1):
        dp[i][0] = i
    for j in range(n + 1):
        dp[0][j] = j

    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if word1[i - 1] == word2[j - 1]:
                dp[i][j] = dp[i - 1][j - 1]
            else:
                dp[i][j] = 1 + min(dp[i - 1][j],    # deletion
                                   dp[i][j - 1],    # insertion
                                   dp[i - 1][j - 1]) # substitution
    return dp[m][n]

The function uses a table (dp) to store results of smaller problems.

Each cell dp[i][j] shows the edit distance between first i letters of word1 and first j letters of word2.

Examples
Only one substitution needed: 'a' to 'u'.
NLP
distance = levenshtein_distance("cat", "cut")
print(distance)  # Output: 1
Empty string to 'dog' needs 3 insertions.
NLP
distance = levenshtein_distance("", "dog")
print(distance)  # Output: 3
One insertion of 's' at the end.
NLP
distance = levenshtein_distance("book", "books")
print(distance)  # Output: 1
Both words are the same, so no changes needed.
NLP
distance = levenshtein_distance("apple", "apple")
print(distance)  # Output: 0
Sample Model

This program calculates the edit distance between 'kitten' and 'sitting'. It prints the number of changes needed to turn one word into the other.

NLP
def levenshtein_distance(word1: str, word2: str) -> int:
    m, n = len(word1), len(word2)
    dp = [[0] * (n + 1) for _ in range(m + 1)]

    for i in range(m + 1):
        dp[i][0] = i
    for j in range(n + 1):
        dp[0][j] = j

    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if word1[i - 1] == word2[j - 1]:
                dp[i][j] = dp[i - 1][j - 1]
            else:
                dp[i][j] = 1 + min(dp[i - 1][j],    # deletion
                                   dp[i][j - 1],    # insertion
                                   dp[i - 1][j - 1]) # substitution
    return dp[m][n]

# Words to compare
word_a = "kitten"
word_b = "sitting"

print(f"Edit distance between '{word_a}' and '{word_b}':")
distance = levenshtein_distance(word_a, word_b)
print(distance)
OutputSuccess
Important Notes

Time complexity is O(m*n), where m and n are the lengths of the two words.

Space complexity is also O(m*n) because of the table used.

Common mistake: forgetting to initialize the first row and column of the table, which represent empty strings.

Use edit distance when you want to measure how similar two strings are. For very long strings, consider faster approximate methods.

Summary

Edit distance counts the smallest number of changes to turn one word into another.

It helps in spell checking, search, and text comparison.

The dynamic programming method uses a table to build the answer step-by-step.