Overview - describe() for statistical summary

What is it?

The describe() function in pandas gives a quick summary of the main statistics for data in a table. It shows numbers like count, mean, and percentiles for each column. This helps understand the data's shape and spread without looking at every value. It works on numbers and can also summarize text data differently.

Why it matters

Without describe(), you would have to calculate many statistics by hand or write extra code, which is slow and error-prone. Describe() saves time and helps spot problems like missing data or strange values early. This makes data analysis faster and more reliable, helping decisions based on data be smarter.

Where it fits

Before using describe(), you should know how to load data into pandas and understand basic tables (DataFrames). After describe(), you can learn deeper data cleaning, visualization, and statistical testing to explore data further.

Mental Model

Core Idea

Describe() quickly summarizes key statistics of each column to give a snapshot of the data's distribution and completeness.

Think of it like...

It's like a health check-up report for your data, showing vital signs like average, spread, and counts so you know if your data is healthy or needs attention.

┌───────────────┬───────────┬───────────┬───────────┬───────────┐
│ Statistic     │ Column A  │ Column B  │ Column C  │ ...       │
├───────────────┼───────────┼───────────┼───────────┼───────────┤
│ count         │ 100       │ 100       │ 100       │           │
│ mean          │ 50.5      │ 20.1      │ NaN       │           │
│ std           │ 10.2      │ 5.3       │ NaN       │           │
│ min           │ 30        │ 10        │ NaN       │           │
│ 25%           │ 45        │ 15        │ NaN       │           │
│ 50% (median)  │ 50        │ 20        │ NaN       │           │
│ 75%           │ 55        │ 25        │ NaN       │           │
│ max           │ 70        │ 30        │ NaN       │           │
└───────────────┴───────────┴───────────┴───────────┴───────────┘

Build-Up - 6 Steps

1

FoundationWhat describe() Does Simply

Concept: Introduce the basic purpose of describe() to get quick stats.

Load a simple table with numbers and call describe() on it. It returns count, mean, std, min, max, and quartiles for each column automatically.

Result

A table showing count, mean, std, min, 25%, 50%, 75%, and max for each numeric column.

Understanding that describe() gives a fast overview helps you quickly check data quality and distribution without extra code.

2

FoundationUsing describe() on Different Data Types

3

IntermediateCustomizing describe() Output

4

IntermediateHandling Missing Data in describe()

5

AdvancedUsing describe() with MultiIndex and Complex Data

6

ExpertPerformance and Internals of describe()

Under the Hood

Describe() works by calling a set of aggregation functions on each column of the DataFrame. For numeric data, it computes count, mean, standard deviation, min, max, and percentiles using numpy's fast methods. For categorical data, it calculates count, unique values, most frequent value, and its frequency. It skips missing values in calculations but counts non-missing entries. The function adapts its output based on data type and user parameters.

Why designed this way?

Describe() was designed to give a quick, general-purpose summary without needing users to write multiple commands. It balances detail and simplicity, providing the most useful stats by default. Alternatives like manual aggregation are slower and error-prone. The design also supports different data types and missing data gracefully, making it flexible for real-world messy data.

DataFrame Columns
   │
   ├─ Numeric Columns ──> Aggregations: count, mean, std, min, percentiles, max
   │
   └─ Categorical Columns ──> Aggregations: count, unique, top, freq
   │
   └─ Missing Values ──> Ignored in calculations, counted in count
   │
   └─ Output: Summary Table with statistics per column

Myth Busters - 4 Common Misconceptions

Quick: Does describe() include missing values in its count statistic? Commit to yes or no.

Common Belief:Describe() counts all rows including missing values in the count statistic.

Tap to reveal reality

Quick: Does describe() show the same statistics for text and numeric columns? Commit to yes or no.

Common Belief:Describe() always shows mean, std, and percentiles for all columns regardless of type.

Tap to reveal reality

Quick: Can you customize which percentiles describe() shows by default? Commit to yes or no.

Common Belief:Describe() always shows fixed percentiles (25%, 50%, 75%) and cannot be changed.

Tap to reveal reality

Quick: Does describe() work the same on grouped data as on whole DataFrames? Commit to yes or no.

Common Belief:Describe() cannot be used on grouped data or multi-index DataFrames.

Tap to reveal reality

Expert Zone

1

Describe() uses numpy's nan-aware functions internally to handle missing data efficiently without extra user effort.

2

When used on categorical data with many unique values, describe() can be slow because it counts unique and top values, which requires scanning all entries.

3

The percentiles parameter accepts any list of floats between 0 and 1, allowing precise control over which quantiles to compute, but extreme percentiles can be less stable on small datasets.

When NOT to use

Describe() is not suitable when you need very detailed or custom statistics like skewness, kurtosis, or complex aggregations. In those cases, use pandas aggregation functions or specialized libraries like scipy or statsmodels.

Production Patterns

In real-world projects, describe() is often the first step in exploratory data analysis to quickly check data shape and quality. It is combined with visualization tools and custom aggregations to build a full understanding of datasets before modeling.

Connections

Exploratory Data Analysis (EDA)

Describe() is a core tool used in EDA to summarize data before deeper analysis.

Mastering describe() helps you quickly grasp data characteristics, making EDA faster and more effective.

Summary Statistics in Statistics

Describe() automates calculation of common summary statistics used in statistics.

Understanding describe() connects programming with statistical concepts like mean, median, and quartiles, bridging coding and theory.

Health Check Reports in Medicine

Describe() functions like a health check report, summarizing vital signs of data health.

Seeing describe() as a health check helps appreciate its role in spotting data issues early, similar to how doctors use reports to detect health problems.

Common Pitfalls

#1Assuming describe() includes missing values in count.

Wrong approach:df.describe() # count shows total rows including NaNs

Correct approach:df.describe() # count shows only non-missing values

Root cause:Misunderstanding that count counts all rows instead of only present values.

#2Expecting numeric statistics on text columns.

Wrong approach:df['text_column'].describe() # expecting mean or std

Correct approach:df['text_column'].describe() # shows count, unique, top, freq

Root cause:Not knowing describe() adapts output based on data type.

#3Not customizing percentiles when needed.

Wrong approach:df.describe() # default percentiles only

Correct approach:df.describe(percentiles=[0.1, 0.5, 0.9]) # custom percentiles

Root cause:Unawareness of percentiles parameter limits insight into data distribution.

Key Takeaways

Describe() is a quick way to get important statistics about each column in your data.

It adapts its summary based on whether data is numeric or categorical, showing relevant stats for each.

Describe() ignores missing values in calculations but counts how many values are present.

You can customize which columns and percentiles describe() shows to fit your analysis needs.

Using describe() early helps catch data issues and understand your data's shape before deeper analysis.