How to Use describe() in pandas for Data Summary
Use
DataFrame.describe() in pandas to get a quick summary of statistics like count, mean, standard deviation, min, max, and quartiles for numeric columns by default. You can also include categorical data by passing include='all' to see counts and unique values.Syntax
The describe() function is called on a pandas DataFrame or Series to generate descriptive statistics.
DataFrame.describe(percentiles=None, include=None, exclude=None, datetime_is_numeric=False)percentiles: List of percentiles to include (default includes 25%, 50%, 75%).include: Data types to include (e.g.,'all','object','number').exclude: Data types to exclude.datetime_is_numeric: Treat datetime columns as numeric if True.
python
df.describe(percentiles=None, include=None, exclude=None, datetime_is_numeric=False)
Example
This example shows how to use describe() on a DataFrame with numeric and categorical data. It demonstrates the default numeric summary and how to include all columns.
python
import pandas as pd data = { 'age': [25, 30, 22, 40, 28], 'salary': [50000, 60000, 45000, 80000, 52000], 'department': ['HR', 'Engineering', 'HR', 'Management', 'Engineering'] } df = pd.DataFrame(data) # Default describe (numeric columns only) numeric_summary = df.describe() # Describe including all columns full_summary = df.describe(include='all') print('Numeric Summary:') print(numeric_summary) print('\nFull Summary:') print(full_summary)
Output
Numeric Summary:
age salary
count 5.000000 5.000000
mean 29.000000 57400.000000
std 7.071068 13960.498940
min 22.000000 45000.000000
25% 25.000000 50000.000000
50% 28.000000 52000.000000
75% 30.000000 60000.000000
max 40.000000 80000.000000
Full Summary:
age salary department
count 5.000000 5.000000 5
unique NaN NaN 3
top NaN NaN HR
freq NaN NaN 2
mean 29.000000 57400.000000 NaN
std 7.071068 13960.498940 NaN
min 22.000000 45000.000000 NaN
25% 25.000000 50000.000000 NaN
50% 28.000000 52000.000000 NaN
75% 30.000000 60000.000000 NaN
max 40.000000 80000.000000 NaN
Common Pitfalls
One common mistake is expecting describe() to summarize non-numeric columns by default. It only summarizes numeric columns unless you specify include='all'. Another pitfall is not noticing that percentiles can be customized but must be between 0 and 1.
Also, calling describe() on an empty DataFrame or one with all missing values returns an empty summary.
python
import pandas as pd data = {'name': ['Alice', 'Bob', 'Charlie'], 'score': [85, 90, 95]} df = pd.DataFrame(data) # Wrong: expecting 'name' summary by default print(df.describe()) # Only 'score' summarized # Right: include all columns print(df.describe(include='all'))
Output
score
count 3
mean 90.0
std 5.0
min 85.0
25% 87.5
50% 90.0
75% 92.5
max 95.0
name score
count 3 3
unique 3 NaN
top Alice NaN
freq 1 NaN
mean NaN 90.0
std NaN 5.0
min NaN 85.0
25% NaN 87.5
50% NaN 90.0
75% NaN 92.5
max NaN 95.0
Quick Reference
Here is a quick summary of key describe() options:
| Parameter | Description | Default |
|---|---|---|
| percentiles | List of percentiles to include (0 to 1) | [0.25, 0.5, 0.75] |
| include | Data types to include (e.g., 'all', 'number', 'object') | None (numeric only) |
| exclude | Data types to exclude | None |
| datetime_is_numeric | Treat datetime columns as numeric if True | False |
Key Takeaways
Use df.describe() to quickly get summary statistics of numeric columns in a DataFrame.
Pass include='all' to describe() to get summaries of all columns including categorical data.
Customize percentiles with the percentiles parameter using values between 0 and 1.
describe() returns count, mean, std, min, max, and quartiles by default for numeric data.
Empty or all-NaN DataFrames will produce empty summaries with describe().