How to Use str.get_dummies in pandas for One-Hot Encoding
Use
str.get_dummies() on a pandas Series containing strings to convert each unique value into separate one-hot encoded columns. This method splits strings by a delimiter and creates binary columns indicating the presence of each category.Syntax
The basic syntax of str.get_dummies() is:
Series.str.get_dummies(sep='delimiter')
Where:
Seriesis a pandas Series with string values.sepis the delimiter string used to split each string into parts (default is|).
This method returns a DataFrame with one column per unique split value and 1/0 indicating presence.
python
Series.str.get_dummies(sep='delimiter')
Example
This example shows how to use str.get_dummies() to convert a Series of strings with comma-separated values into one-hot encoded columns.
python
import pandas as pd # Sample data: Series with comma-separated categories s = pd.Series(['apple,banana', 'banana', 'apple,orange', 'banana,orange,apple']) # Use str.get_dummies with comma as separator one_hot = s.str.get_dummies(sep=',') print(one_hot)
Output
apple banana orange
0 1 1 0
1 0 1 0
2 1 0 1
3 1 1 1
Common Pitfalls
Common mistakes when using str.get_dummies() include:
- Not specifying the correct
sepdelimiter, which leads to incorrect splitting. - Applying
str.get_dummies()on non-string data or Series with missing values without cleaning first. - Expecting it to work on DataFrames directly instead of Series.
Always ensure your data is a string Series and the separator matches your data format.
python
import pandas as pd # Wrong: no separator specified for comma-separated data s = pd.Series(['cat,dog', 'dog', 'cat,bird']) wrong = s.str.get_dummies() # Default sep='|', so no split # Right: specify sep=',' right = s.str.get_dummies(sep=',') print('Wrong output:\n', wrong) print('\nRight output:\n', right)
Output
Wrong output:
cat,dog cat,bird dog
0 1 0 0
1 0 0 1
2 0 1 0
Right output:
bird cat dog
0 0 1 1
1 0 0 1
2 1 1 0
Quick Reference
Summary tips for using str.get_dummies():
- Use on a pandas Series with string values.
- Set
septo the delimiter used in your strings (e.g., ',', '|', ' '). - Returns a DataFrame with one-hot encoded columns for each unique split value.
- Useful for converting multi-label text data into numeric format for analysis.
Key Takeaways
Use str.get_dummies on a pandas Series of strings to create one-hot encoded columns.
Always specify the correct separator with the sep parameter to split strings properly.
The method returns a DataFrame with binary columns for each unique category found.
Ensure your data is clean and string-typed before applying str.get_dummies.
It is ideal for multi-label categorical data stored as delimiter-separated strings.