How to Use str.extract in pandas for Text Extraction
Use
str.extract in pandas to pull specific parts of text from a column by applying a regular expression pattern. It returns a DataFrame with the extracted groups, making it easy to separate or analyze text data.Syntax
The basic syntax of str.extract is:
Series.str.extract(pat, flags=0, expand=True)
Where:
patis the regular expression pattern with capture groups()to extract.flagsare optional regex flags likere.IGNORECASE.expandcontrols output format:Truereturns a DataFrame,Falsereturns a Series.
python
Series.str.extract(pat, flags=0, expand=True)
Example
This example shows how to extract the area code from phone numbers in a pandas Series using str.extract with a regex pattern.
python
import pandas as pd # Sample data phones = pd.Series(['123-456-7890', '987-654-3210', '555-123-4567']) # Extract area code (first 3 digits) using regex group area_codes = phones.str.extract(r'(\d{3})') print(area_codes)
Output
0
0 123
1 987
2 555
Common Pitfalls
Common mistakes when using str.extract include:
- Not using parentheses
()in the regex pattern to define capture groups, so nothing is extracted. - Expecting a Series output when
expand=Truereturns a DataFrame. - Using patterns that do not match the string format, resulting in
NaNvalues.
python
import pandas as pd phones = pd.Series(['123-456-7890', '987-654-3210']) # Wrong: no capture group, returns NaN wrong = phones.str.extract(r'\d{3}') # Right: with capture group right = phones.str.extract(r'(\d{3})') print('Wrong output:') print(wrong) print('\nRight output:') print(right)
Output
Wrong output:
0
0 NaN
1 NaN
Right output:
0
0 123
1 987
Quick Reference
| Parameter | Description |
|---|---|
| pat | Regex pattern with capture groups to extract text |
| flags | Regex flags like re.IGNORECASE (default 0) |
| expand | If True, returns DataFrame; if False, returns Series |
Key Takeaways
Use str.extract with regex capture groups to pull parts of strings in pandas columns.
Always include parentheses in your regex pattern to define what to extract.
The output is a DataFrame by default; set expand=False for a Series if needed.
If the pattern does not match, the result will contain NaN values.
str.extract is great for splitting or cleaning text data in pandas.