How to Standardize a Column in Pandas DataFrame
To standardize a column in pandas, subtract the column's mean and divide by its standard deviation using
df['column'] = (df['column'] - df['column'].mean()) / df['column'].std(). This scales the data to have a mean of 0 and a standard deviation of 1.Syntax
The basic syntax to standardize a column in pandas is:
df['column']: Selects the column to standardize.df['column'].mean(): Calculates the mean of the column.df['column'].std(): Calculates the standard deviation of the column.- Subtract the mean and divide by the standard deviation to scale the data.
python
df['column'] = (df['column'] - df['column'].mean()) / df['column'].std()
Example
This example shows how to standardize the 'Age' column in a pandas DataFrame.
python
import pandas as pd data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'], 'Age': [25, 30, 35, 40]} df = pd.DataFrame(data) # Standardize the 'Age' column df['Age_standardized'] = (df['Age'] - df['Age'].mean()) / df['Age'].std() print(df)
Output
Name Age Age_standardized
0 Alice 25 -1.341641
1 Bob 30 -0.447214
2 Charlie 35 0.447214
3 David 40 1.341641
Common Pitfalls
Common mistakes when standardizing columns include:
- Forgetting to assign the standardized values back to the DataFrame.
- Standardizing columns with non-numeric data, which causes errors.
- Using
std()without specifyingddof=0if population standard deviation is needed (default is sample std).
Always check the data type before standardizing and assign the result to a new or existing column.
python
import pandas as pd data = {'Name': ['Alice', 'Bob'], 'Age': ['twenty-five', 'thirty']} df = pd.DataFrame(data) # Wrong: trying to standardize non-numeric data # df['Age_standardized'] = (df['Age'] - df['Age'].mean()) / df['Age'].std() # This will raise an error # Right: convert to numeric first or select numeric columns only # Example with numeric data: data = {'Name': ['Alice', 'Bob'], 'Age': [25, 30]} df = pd.DataFrame(data) df['Age_standardized'] = (df['Age'] - df['Age'].mean()) / df['Age'].std()
Quick Reference
Tips for standardizing columns in pandas:
- Use
df['col'] = (df['col'] - df['col'].mean()) / df['col'].std()to standardize. - Ensure the column is numeric before standardizing.
- Assign the result to a new column to keep original data.
- Use
ddof=0instd()for population standard deviation if needed.
Key Takeaways
Standardize a pandas column by subtracting its mean and dividing by its standard deviation.
Always ensure the column data is numeric before standardizing to avoid errors.
Assign the standardized values to a new or existing column to preserve data.
Use the default sample standard deviation or specify ddof=0 for population standard deviation.
Standardization scales data to mean 0 and standard deviation 1, useful for many analyses.