0
0
Data Analysis Pythondata~10 mins

Extracting with str.extract (regex) in Data Analysis Python - Step-by-Step Execution

Choose your learning style9 modes available
Concept Flow - Extracting with str.extract (regex)
Start with a pandas Series
Apply str.extract with regex
Regex matches groups in each string
Extract matched groups into new DataFrame columns
Result: DataFrame with extracted parts
End
We start with a pandas Series of strings, apply str.extract with a regex pattern that captures groups, and get a DataFrame with those extracted groups as columns.
Execution Sample
Data Analysis Python
import pandas as pd
s = pd.Series(['abc123', 'def456', 'ghi789'])
result = s.str.extract(r'([a-z]+)(\d+)')
print(result)
This code extracts letters and digits from each string in the Series into separate columns.
Execution Table
StepInput StringRegex PatternMatch GroupsExtracted Output
1'abc123'r'([a-z]+)(\d+)'Group 1: 'abc', Group 2: '123'['abc', '123']
2'def456'r'([a-z]+)(\d+)'Group 1: 'def', Group 2: '456'['def', '456']
3'ghi789'r'([a-z]+)(\d+)'Group 1: 'ghi', Group 2: '789'['ghi', '789']
4End of SeriesN/AN/ADataFrame with columns 0 and 1 containing extracted groups
💡 All strings processed; extraction complete with groups captured into DataFrame columns.
Variable Tracker
VariableStartAfter 1After 2After 3Final
sSeries(['abc123', 'def456', 'ghi789'])SameSameSameSame
resultNoneDataFrame with first row ['abc', '123']DataFrame with first two rows ['abc', '123'], ['def', '456']DataFrame with first three rows ['abc', '123'], ['def', '456'], ['ghi', '789']DataFrame with all extracted groups
Key Moments - 2 Insights
Why does str.extract return a DataFrame instead of a Series?
Because the regex has multiple capturing groups, str.extract returns each group as a separate column in a DataFrame, as shown in execution_table rows 1-3.
What happens if a string does not match the regex pattern?
If no match is found, str.extract returns NaN for that row's columns. This is not shown here but would appear as missing values in the output DataFrame.
Visual Quiz - 3 Questions
Test your understanding
Look at the execution_table at Step 2. What are the extracted groups for the input 'def456'?
AGroup 1: 'def', Group 2: '456'
BGroup 1: 'de', Group 2: 'f456'
CGroup 1: 'def4', Group 2: '56'
DNo match found
💡 Hint
Refer to execution_table row with Step 2 showing the matched groups.
At which step does the extraction process finish for all strings?
AStep 3
BStep 4
CStep 1
DStep 5
💡 Hint
Check the exit_note and execution_table last row describing completion.
If the regex pattern had only one group, what would be the type of the output?
AA DataFrame with one column
BA list of strings
CA Series with extracted strings
DAn error
💡 Hint
Recall that str.extract returns a Series if there is only one capturing group.
Concept Snapshot
pandas.Series.str.extract(regex)
- Uses regex with capturing groups
- Extracts matched groups into DataFrame columns
- Returns DataFrame if multiple groups, Series if one group
- Missing matches become NaN
- Useful to split strings by patterns
Full Transcript
We start with a pandas Series of strings. We apply the str.extract method with a regex pattern that has capturing groups. For each string, the regex matches parts and extracts groups. These groups become columns in a new DataFrame. If a string does not match, the output is NaN for that row. The process repeats for all strings in the Series. The final result is a DataFrame showing extracted parts side by side.