PandasHow-ToBeginner · 3 min read

How to Use describe() in pandas for Data Summary

Use DataFrame.describe() in pandas to get a quick summary of statistics like count, mean, standard deviation, min, max, and quartiles for numeric columns by default. You can also include categorical data by passing include='all' to see counts and unique values.

📐

Syntax

The describe() function is called on a pandas DataFrame or Series to generate descriptive statistics.

DataFrame.describe(percentiles=None, include=None, exclude=None, datetime_is_numeric=False)
percentiles: List of percentiles to include (default includes 25%, 50%, 75%).
include: Data types to include (e.g., 'all', 'object', 'number').
exclude: Data types to exclude.
datetime_is_numeric: Treat datetime columns as numeric if True.

python

df.describe(percentiles=None, include=None, exclude=None, datetime_is_numeric=False)

💻

Example

This example shows how to use describe() on a DataFrame with numeric and categorical data. It demonstrates the default numeric summary and how to include all columns.

python

import pandas as pd

data = {
    'age': [25, 30, 22, 40, 28],
    'salary': [50000, 60000, 45000, 80000, 52000],
    'department': ['HR', 'Engineering', 'HR', 'Management', 'Engineering']
}
df = pd.DataFrame(data)

# Default describe (numeric columns only)
numeric_summary = df.describe()

# Describe including all columns
full_summary = df.describe(include='all')

print('Numeric Summary:')
print(numeric_summary)
print('\nFull Summary:')
print(full_summary)

Output

Numeric Summary: age salary count 5.000000 5.000000 mean 29.000000 57400.000000 std 7.071068 13960.498940 min 22.000000 45000.000000 25% 25.000000 50000.000000 50% 28.000000 52000.000000 75% 30.000000 60000.000000 max 40.000000 80000.000000 Full Summary: age salary department count 5.000000 5.000000 5 unique NaN NaN 3 top NaN NaN HR freq NaN NaN 2 mean 29.000000 57400.000000 NaN std 7.071068 13960.498940 NaN min 22.000000 45000.000000 NaN 25% 25.000000 50000.000000 NaN 50% 28.000000 52000.000000 NaN 75% 30.000000 60000.000000 NaN max 40.000000 80000.000000 NaN

⚠️

Common Pitfalls

One common mistake is expecting describe() to summarize non-numeric columns by default. It only summarizes numeric columns unless you specify include='all'. Another pitfall is not noticing that percentiles can be customized but must be between 0 and 1.

Also, calling describe() on an empty DataFrame or one with all missing values returns an empty summary.

python

import pandas as pd

data = {'name': ['Alice', 'Bob', 'Charlie'], 'score': [85, 90, 95]}
df = pd.DataFrame(data)

# Wrong: expecting 'name' summary by default
print(df.describe())  # Only 'score' summarized

# Right: include all columns
print(df.describe(include='all'))

Output

score count 3 mean 90.0 std 5.0 min 85.0 25% 87.5 50% 90.0 75% 92.5 max 95.0 name score count 3 3 unique 3 NaN top Alice NaN freq 1 NaN mean NaN 90.0 std NaN 5.0 min NaN 85.0 25% NaN 87.5 50% NaN 90.0 75% NaN 92.5 max NaN 95.0

📊

Quick Reference

Here is a quick summary of key describe() options:

Parameter	Description	Default
percentiles	List of percentiles to include (0 to 1)	[0.25, 0.5, 0.75]
include	Data types to include (e.g., 'all', 'number', 'object')	None (numeric only)
exclude	Data types to exclude	None
datetime_is_numeric	Treat datetime columns as numeric if True	False

✅

Key Takeaways

Use df.describe() to quickly get summary statistics of numeric columns in a DataFrame.

Pass include='all' to describe() to get summaries of all columns including categorical data.

Customize percentiles with the percentiles parameter using values between 0 and 1.

describe() returns count, mean, std, min, max, and quartiles by default for numeric data.

Empty or all-NaN DataFrames will produce empty summaries with describe().