Overview - Descriptive statistics (describe)

What is it?

Descriptive statistics summarize and describe the main features of a dataset using numbers. The 'describe' function in scipy quickly calculates key statistics like mean, variance, and percentiles. These summaries help us understand the data's shape, center, and spread without looking at every value. It is a simple way to get a snapshot of the data.

Why it matters

Without descriptive statistics, we would struggle to understand large datasets quickly. Imagine trying to analyze thousands of numbers without any summary; it would be overwhelming and error-prone. Descriptive statistics give us clear insights to make decisions, spot errors, or prepare data for further analysis. They are the foundation for all data science work.

Where it fits

Before learning descriptive statistics, you should know basic Python and how to handle data arrays. After mastering descriptive statistics, you can move on to data visualization and inferential statistics, which build on these summaries to make predictions or test hypotheses.

Mental Model

Core Idea

Descriptive statistics are like a quick summary report that tells you the main story of your data without reading every detail.

Think of it like...

It's like reading the back cover of a book to get the main idea before deciding to read the whole story.

┌───────────────────────────────┐
│          Dataset              │
├─────────────┬─────────────────┤
│ Raw Data    │ [values, values] │
├─────────────┴─────────────────┤
│  describe() function          │
├─────────────┬─────────────────┤
│ Output      │ Summary stats   │
│             │ - count         │
│             │ - mean          │
│             │ - variance      │
│             │ - min, max      │
│             │ - percentiles   │
└─────────────┴─────────────────┘

Build-Up - 6 Steps

1

FoundationUnderstanding basic statistics terms

Concept: Introduce simple terms like mean, median, variance, and percentiles.

Mean is the average of numbers. Median is the middle value when data is sorted. Variance measures how spread out the data is. Percentiles show the value below which a certain percentage of data falls.

Result

Learners can explain what each basic statistic means in simple words.

Knowing these terms is essential because they form the building blocks of all descriptive statistics.

2

FoundationGetting started with scipy.describe

3

IntermediateInterpreting the describe output

4

IntermediateHandling multidimensional data

5

AdvancedCustomizing describe with additional options

6

ExpertLimitations and alternatives to scipy.describe

Under the Hood

The describe function computes statistics by iterating over the data array once or twice. It calculates count by counting elements, mean by summing and dividing, variance by summing squared differences from the mean, and min/max by comparing values. Skewness and kurtosis are calculated using formulas involving central moments. It uses efficient C-backed numpy operations internally for speed.

Why designed this way?

Describe was designed to provide a quick, all-in-one summary to avoid repetitive code and reduce errors. It balances speed and completeness by including common statistics in one call. Alternatives like pandas describe came later to handle tabular data better, but scipy.describe remains useful for simple arrays.

┌───────────────┐
│ Input Array   │
├───────────────┤
│ Count values  │
│ Calculate sum│
│ Calculate sum of squares│
│ Find min, max │
│ Calculate skewness, kurtosis│
├───────────────┤
│ Package results│
└───────┬───────┘
        │
        ▼
┌───────────────┐
│ DescribeResult │
│ (count, mean,  │
│ variance, min, │
│ max, skewness, │
│ kurtosis)      │
└───────────────┘

Myth Busters - 4 Common Misconceptions

Quick: Does scipy.describe return population variance by default? Commit yes or no.

Common Belief:scipy.describe returns the population variance by default.

Tap to reveal reality

Quick: Does describe handle missing values automatically? Commit yes or no.

Common Belief:describe automatically ignores missing values (NaNs) in the data.

Tap to reveal reality

Quick: Does describe work on each column separately for 2D arrays? Commit yes or no.

Common Belief:describe calculates statistics separately for each column in a 2D array.

Tap to reveal reality

Quick: Are skewness and kurtosis always included in describe output? Commit yes or no.

Common Belief:Skewness and kurtosis are always included in the output of describe.

Tap to reveal reality

Expert Zone

1

The variance returned is the unbiased sample variance, which is important for statistical inference but can confuse beginners expecting population variance.

2

The skewness and kurtosis calculations use Fisher's definition (excess kurtosis), which centers around zero for normal distributions, a subtlety often missed.

3

The nan_policy parameter allows control over missing data handling, but improper use can silently produce wrong statistics.

When NOT to use

Do not use scipy.describe for complex datasets with multiple features or missing data; instead, use pandas DataFrame.describe() or specialized libraries like statsmodels. For streaming or very large data, incremental statistics tools are better.

Production Patterns

In production, scipy.describe is often used for quick sanity checks on numeric arrays before deeper analysis. Data engineers use it in pipelines to validate data quality. Data scientists switch to pandas for richer summaries and visualization-ready outputs.

Connections

Pandas DataFrame.describe()

builds-on

Understanding scipy.describe helps grasp pandas describe, which extends the idea to tabular data with better handling of columns and missing values.

Inferential statistics

foundation for

Descriptive statistics provide the necessary summaries that inferential statistics use to make predictions or test hypotheses.

Summary reports in business

same pattern

Just like descriptive statistics summarize data, business summary reports condense complex information into key points for quick decisions.

Common Pitfalls

#1Assuming describe handles missing values automatically.

Wrong approach:from scipy.stats import describe import numpy as np data = np.array([1, 2, np.nan, 4]) result = describe(data) print(result)

Correct approach:from scipy.stats import describe import numpy as np data = np.array([1, 2, np.nan, 4]) result = describe(data, nan_policy='omit') print(result)

Root cause:Misunderstanding that describe does not ignore NaNs by default leads to NaN results or errors.

#2Using describe on 2D data expecting per-column stats.

Wrong approach:from scipy.stats import describe import numpy as np data = np.array([[1, 2], [3, 4]]) result = describe(data) print(result)

Correct approach:from scipy.stats import describe import numpy as np data = np.array([[1, 2], [3, 4]]) for col in data.T: print(describe(col))

Root cause:Not realizing describe flattens arrays causes wrong assumptions about output.

#3Confusing sample variance with population variance.

Wrong approach:from scipy.stats import describe import numpy as np data = np.array([1, 2, 3, 4, 5]) result = describe(data) print('Variance:', result.variance) # Treat as population variance

Correct approach:from scipy.stats import describe import numpy as np data = np.array([1, 2, 3, 4, 5]) result = describe(data) print('Sample Variance:', result.variance) # For population variance, multiply by (n-1)/n

Root cause:Lack of clarity on variance type leads to incorrect interpretation of data spread.

Key Takeaways

Descriptive statistics provide a quick summary of data using key numbers like mean, variance, and percentiles.

The scipy.stats.describe function calculates many statistics at once, saving time and reducing errors.

By default, describe returns sample variance and does not ignore missing values unless specified.

Describe flattens multidimensional arrays, so for per-column stats, you must handle columns separately.

Knowing the limits of describe helps you choose better tools like pandas for complex or tabular data.