Bird
Raised Fist0
NlpHow-ToBeginner ยท 3 min read

How to Tokenize Using Regex in NLP: Simple Guide

To tokenize text using regex in NLP, you define a pattern that matches words or tokens and use functions like re.findall() in Python to extract them. This method splits text based on custom rules, allowing flexible tokenization beyond simple spaces.
๐Ÿ“

Syntax

Use the re.findall(pattern, text) function where:

  • pattern is a regex string defining what a token looks like (e.g., words, numbers).
  • text is the input string to tokenize.

This returns a list of all matches as tokens.

python
import re
pattern = r"\w+"
text = "Hello, world! Let's tokenize this text."
tokens = re.findall(pattern, text)
print(tokens)
Output
['Hello', 'world', 'Let', 's', 'tokenize', 'this', 'text']
๐Ÿ’ป

Example

This example shows how to tokenize a sentence into words using regex that matches sequences of letters and apostrophes to keep contractions intact.

python
import re
text = "Don't split contractions like don't or it's."
pattern = r"[A-Za-z']+"
tokens = re.findall(pattern, text)
print(tokens)
Output
["Don't", 'split', 'contractions', 'like', "don't", 'or', "it's"]
โš ๏ธ

Common Pitfalls

Common mistakes include:

  • Using a pattern that splits contractions or hyphenated words unintentionally.
  • Ignoring punctuation that should be separate tokens.
  • Not accounting for numbers or special characters if needed.

Always test your regex on sample text to ensure tokens match your needs.

python
import re
text = "It's 3:00 p.m. - time to tokenize!"

# Wrong pattern splits contractions and punctuation badly
wrong_pattern = r"\w+"
print(re.findall(wrong_pattern, text))

# Better pattern keeps contractions and separates punctuation
better_pattern = r"\b\w+'?\w*\b|[.,!?-]"
print(re.findall(better_pattern, text))
Output
['It', 's', '3', '00', 'p', 'm', 'time', 'to', 'tokenize'] ["It's", '3', '00', 'p', 'm', '.', '-', 'time', 'to', 'tokenize', '!']
๐Ÿ“Š

Quick Reference

Regex patterns for common token types:

  • \w+: Matches words (letters, digits, underscore).
  • [A-Za-z']+: Matches words with letters and apostrophes (for contractions).
  • \b\w+'?\w*\b: Matches words with optional apostrophes inside.
  • [.,!?-]: Matches common punctuation as separate tokens.
โœ…

Key Takeaways

Use re.findall() with a regex pattern to extract tokens from text.
Design regex patterns carefully to handle contractions, punctuation, and special cases.
Test your regex on sample text to avoid splitting tokens incorrectly.
Regex tokenization is flexible but requires tuning for your specific text data.