0
0
PandasHow-ToBeginner · 3 min read

How to Use describe() in pandas for Data Summary

Use DataFrame.describe() in pandas to get a quick summary of statistics like count, mean, standard deviation, min, max, and quartiles for numeric columns by default. You can also include categorical data by passing include='all' to see counts and unique values.
📐

Syntax

The describe() function is called on a pandas DataFrame or Series to generate descriptive statistics.

  • DataFrame.describe(percentiles=None, include=None, exclude=None, datetime_is_numeric=False)
  • percentiles: List of percentiles to include (default includes 25%, 50%, 75%).
  • include: Data types to include (e.g., 'all', 'object', 'number').
  • exclude: Data types to exclude.
  • datetime_is_numeric: Treat datetime columns as numeric if True.
python
df.describe(percentiles=None, include=None, exclude=None, datetime_is_numeric=False)
💻

Example

This example shows how to use describe() on a DataFrame with numeric and categorical data. It demonstrates the default numeric summary and how to include all columns.

python
import pandas as pd

data = {
    'age': [25, 30, 22, 40, 28],
    'salary': [50000, 60000, 45000, 80000, 52000],
    'department': ['HR', 'Engineering', 'HR', 'Management', 'Engineering']
}
df = pd.DataFrame(data)

# Default describe (numeric columns only)
numeric_summary = df.describe()

# Describe including all columns
full_summary = df.describe(include='all')

print('Numeric Summary:')
print(numeric_summary)
print('\nFull Summary:')
print(full_summary)
Output
Numeric Summary: age salary count 5.000000 5.000000 mean 29.000000 57400.000000 std 7.071068 13960.498940 min 22.000000 45000.000000 25% 25.000000 50000.000000 50% 28.000000 52000.000000 75% 30.000000 60000.000000 max 40.000000 80000.000000 Full Summary: age salary department count 5.000000 5.000000 5 unique NaN NaN 3 top NaN NaN HR freq NaN NaN 2 mean 29.000000 57400.000000 NaN std 7.071068 13960.498940 NaN min 22.000000 45000.000000 NaN 25% 25.000000 50000.000000 NaN 50% 28.000000 52000.000000 NaN 75% 30.000000 60000.000000 NaN max 40.000000 80000.000000 NaN
⚠️

Common Pitfalls

One common mistake is expecting describe() to summarize non-numeric columns by default. It only summarizes numeric columns unless you specify include='all'. Another pitfall is not noticing that percentiles can be customized but must be between 0 and 1.

Also, calling describe() on an empty DataFrame or one with all missing values returns an empty summary.

python
import pandas as pd

data = {'name': ['Alice', 'Bob', 'Charlie'], 'score': [85, 90, 95]}
df = pd.DataFrame(data)

# Wrong: expecting 'name' summary by default
print(df.describe())  # Only 'score' summarized

# Right: include all columns
print(df.describe(include='all'))
Output
score count 3 mean 90.0 std 5.0 min 85.0 25% 87.5 50% 90.0 75% 92.5 max 95.0 name score count 3 3 unique 3 NaN top Alice NaN freq 1 NaN mean NaN 90.0 std NaN 5.0 min NaN 85.0 25% NaN 87.5 50% NaN 90.0 75% NaN 92.5 max NaN 95.0
📊

Quick Reference

Here is a quick summary of key describe() options:

ParameterDescriptionDefault
percentilesList of percentiles to include (0 to 1)[0.25, 0.5, 0.75]
includeData types to include (e.g., 'all', 'number', 'object')None (numeric only)
excludeData types to excludeNone
datetime_is_numericTreat datetime columns as numeric if TrueFalse

Key Takeaways

Use df.describe() to quickly get summary statistics of numeric columns in a DataFrame.
Pass include='all' to describe() to get summaries of all columns including categorical data.
Customize percentiles with the percentiles parameter using values between 0 and 1.
describe() returns count, mean, std, min, max, and quartiles by default for numeric data.
Empty or all-NaN DataFrames will produce empty summaries with describe().