0
0
PandasHow-ToBeginner · 3 min read

How to Use str.extract in pandas for Text Extraction

Use str.extract in pandas to pull specific parts of text from a column by applying a regular expression pattern. It returns a DataFrame with the extracted groups, making it easy to separate or analyze text data.
📐

Syntax

The basic syntax of str.extract is:

  • Series.str.extract(pat, flags=0, expand=True)

Where:

  • pat is the regular expression pattern with capture groups () to extract.
  • flags are optional regex flags like re.IGNORECASE.
  • expand controls output format: True returns a DataFrame, False returns a Series.
python
Series.str.extract(pat, flags=0, expand=True)
💻

Example

This example shows how to extract the area code from phone numbers in a pandas Series using str.extract with a regex pattern.

python
import pandas as pd

# Sample data
phones = pd.Series(['123-456-7890', '987-654-3210', '555-123-4567'])

# Extract area code (first 3 digits) using regex group
area_codes = phones.str.extract(r'(\d{3})')

print(area_codes)
Output
0 0 123 1 987 2 555
⚠️

Common Pitfalls

Common mistakes when using str.extract include:

  • Not using parentheses () in the regex pattern to define capture groups, so nothing is extracted.
  • Expecting a Series output when expand=True returns a DataFrame.
  • Using patterns that do not match the string format, resulting in NaN values.
python
import pandas as pd

phones = pd.Series(['123-456-7890', '987-654-3210'])

# Wrong: no capture group, returns NaN
wrong = phones.str.extract(r'\d{3}')

# Right: with capture group
right = phones.str.extract(r'(\d{3})')

print('Wrong output:')
print(wrong)
print('\nRight output:')
print(right)
Output
Wrong output: 0 0 NaN 1 NaN Right output: 0 0 123 1 987
📊

Quick Reference

ParameterDescription
patRegex pattern with capture groups to extract text
flagsRegex flags like re.IGNORECASE (default 0)
expandIf True, returns DataFrame; if False, returns Series

Key Takeaways

Use str.extract with regex capture groups to pull parts of strings in pandas columns.
Always include parentheses in your regex pattern to define what to extract.
The output is a DataFrame by default; set expand=False for a Series if needed.
If the pattern does not match, the result will contain NaN values.
str.extract is great for splitting or cleaning text data in pandas.