0
0
Data Analysis Pythondata~5 mins

Extracting with str.extract (regex) in Data Analysis Python

Choose your learning style9 modes available
Introduction

We use str.extract to pull out specific parts of text from data using patterns. It helps us find and save useful pieces from messy text.

You want to get phone numbers from a list of customer messages.
You need to find dates hidden inside product reviews.
You want to separate area codes from full phone numbers in a contact list.
You need to extract email usernames from email addresses.
You want to pull out hashtags from social media posts.
Syntax
Data Analysis Python
Series.str.extract(pat, flags=0, expand=True)

pat is the pattern you want to find, written as a regular expression (regex).

If expand=True, the result is a DataFrame; if False, it returns a Series.

Examples
Extracts the first 3 digits (like area code) from phone numbers.
Data Analysis Python
df['phone'].str.extract(r'(\d{3})')
Extracts dates in YYYY-MM-DD format from text.
Data Analysis Python
df['text'].str.extract(r'(\d{4}-\d{2}-\d{2})')
Extracts the username part before '@' in email addresses.
Data Analysis Python
df['email'].str.extract(r'([^@]+)@')
Sample Program

This code creates a small table with messages. It then extracts area codes, dates, and email usernames using str.extract with regex patterns.

Data Analysis Python
import pandas as pd

data = {'messages': ['Call me at 415-555-1234', 'My birthday is 1990-05-21', 'Email: user@example.com']}
df = pd.DataFrame(data)

# Extract area code (3 digits) from phone numbers
area_codes = df['messages'].str.extract(r'(\d{3})')

# Extract date in YYYY-MM-DD format
dates = df['messages'].str.extract(r'(\d{4}-\d{2}-\d{2})')

# Extract username from email
usernames = df['messages'].str.extract(r'([\w\.]+)@')

print('Area Codes:')
print(area_codes)
print('\nDates:')
print(dates)
print('\nUsernames:')
print(usernames)
OutputSuccess
Important Notes

Regex patterns use special symbols to match text. For example, \d means any digit.

If no match is found, str.extract returns NaN for that row.

Use parentheses () in regex to capture the part you want to extract.

Summary

str.extract helps pull out parts of text using patterns.

It returns a new table with the extracted pieces or NaN if nothing matches.

Useful for cleaning and organizing messy text data.