Overview - Summary statistics

What is it?

Summary statistics are numbers that describe important features of a set of data. They help us understand the data by showing things like the average, the middle value, and how spread out the data is. In R, we use functions to quickly get these numbers from our data. This helps us see the big picture without looking at every single number.

Why it matters

Without summary statistics, we would have to look at every data point to understand what the data looks like, which is slow and confusing. Summary statistics give us a quick snapshot that helps us make decisions, find patterns, or spot problems. For example, knowing the average height in a group helps us understand the group better than looking at each person's height.

Where it fits

Before learning summary statistics, you should know how to work with basic data types and vectors in R. After this, you can learn about data visualization to see these statistics in graphs or move on to more advanced statistics like hypothesis testing and regression.

Mental Model

Core Idea

Summary statistics are simple numbers that capture the main story of a dataset so you don’t have to look at every detail.

Think of it like...

It's like reading the back cover of a book to get the main idea instead of reading every page.

┌─────────────────────────────┐
│        Data Set             │
│  [many numbers or values]   │
└─────────────┬───────────────┘
              │
              ▼
┌─────────────────────────────┐
│   Summary Statistics         │
│  Mean, Median, Min, Max,     │
│  Quartiles, Standard Deviation│
└─────────────────────────────┘

Build-Up - 7 Steps

1

FoundationUnderstanding basic data vectors

Concept: Learn what a vector is in R and how data is stored.

In R, data is often stored in vectors, which are like lists of values all of the same type. For example, a vector of numbers: c(2, 4, 6, 8). You can create vectors using the c() function.

Result

You can hold and access multiple numbers easily in one variable.

Understanding vectors is key because summary statistics work on these collections of numbers.

2

FoundationCalculating mean and median

3

IntermediateFinding spread with min, max, and range

4

IntermediateUsing quantiles and interquartile range

5

IntermediateMeasuring variability with standard deviation

6

AdvancedUsing summary() for quick overview

7

ExpertHandling missing data in summary statistics

Under the Hood

Summary statistics functions in R work by scanning through the data vector and performing simple calculations like addition, sorting, or counting. For example, mean() adds all numbers and divides by count, median() sorts data and picks the middle value, and sd() calculates the average squared difference from the mean. When missing values (NA) are present, these functions either return NA or skip them if specified.

Why designed this way?

R was designed for statistical computing, so these functions are built to be fast and simple for common tasks. The choice to return NA by default when missing data exists forces users to consciously handle missing values, preventing silent errors. The summary() function bundles many statistics to save time and reduce repetitive code.

┌───────────────┐
│   Data Vector │
│ [values + NA] │
└───────┬───────┘
        │
        ▼
┌─────────────────────────────┐
│ Summary Functions (mean, sd, │
│ median, min, max, quantile) │
└───────┬─────────┬───────────┘
        │         │
        ▼         ▼
┌─────────────┐ ┌─────────────┐
│ Handle NA?  │ │ Calculate   │
│ na.rm=TRUE? │ │ Statistic   │
└─────┬───────┘ └─────┬───────┘
      │               │
      ▼               ▼
┌─────────────┐ ┌─────────────┐
│ Skip NA     │ │ Return Value│
│ or Return NA│ │ (number)    │
└─────────────┘ └─────────────┘

Myth Busters - 4 Common Misconceptions

Quick: Does mean() ignore missing values by default in R? Commit to yes or no.

Common Belief:mean() automatically ignores missing values (NA) without extra instructions.

Tap to reveal reality

Quick: Is the median always the same as the mean? Commit to yes or no.

Common Belief:Median and mean are the same because both measure the center of data.

Tap to reveal reality

Quick: Does summary() only give mean and median? Commit to yes or no.

Common Belief:summary() only returns mean and median values.

Tap to reveal reality

Quick: Does a higher standard deviation mean data points are closer to the mean? Commit to yes or no.

Common Belief:Higher standard deviation means data points are closer to the average.

Tap to reveal reality

Expert Zone

1

Summary statistics can be misleading if data is heavily skewed or has many outliers; always check data shape before trusting them.

2

The choice between mean and median depends on data distribution; median is more robust to outliers.

3

Handling missing data properly is crucial; ignoring NA without understanding can bias your results.

When NOT to use

Summary statistics are not enough when you need to understand relationships between variables or test hypotheses. Use inferential statistics, regression, or visualization instead.

Production Patterns

In real-world data analysis, summary statistics are used as the first step in data cleaning and exploration. Automated reports often include summary() outputs. In production code, handling missing data explicitly and choosing robust statistics like median or trimmed mean is common.

Connections

Data visualization

Builds-on

Knowing summary statistics helps you interpret graphs like boxplots and histograms, which visually show the same data features.

Descriptive statistics in psychology

Same pattern

Summary statistics in R follow the same principles psychologists use to describe test scores, showing how programming supports real-world research.

Executive summaries in business

Builds-on

Just like summary statistics condense data, executive summaries condense reports; both help busy people grasp key points quickly.

Common Pitfalls

#1Ignoring missing data causes wrong results.

Wrong approach:x <- c(1, 2, NA, 4) mean(x) # returns NA

Correct approach:x <- c(1, 2, NA, 4) mean(x, na.rm = TRUE) # returns 2.333

Root cause:Not knowing that mean() returns NA if any missing values exist unless told to remove them.

#2Using mean to describe skewed data misleads about typical values.

Wrong approach:x <- c(1, 2, 3, 100) mean(x) # 26.5 (not typical)

Correct approach:x <- c(1, 2, 3, 100) median(x) # 2.5 (better typical value)

Root cause:Assuming mean always represents the center without checking data shape.

#3Confusing range with standard deviation.

Wrong approach:x <- c(1, 2, 3, 4, 5) range(x) # 1 5 sd(x) # 1.58 # Treating range as measure of spread like sd

Correct approach:Use range to see min and max, use sd to measure spread around mean.

Root cause:Not understanding that range is just difference between extremes, while sd measures average spread.

Key Takeaways

Summary statistics give simple numbers that describe the center, spread, and shape of data.

Mean and median both measure center but behave differently with skewed data or outliers.

Functions like mean(), median(), sd(), and summary() in R help calculate these statistics quickly.

Missing data must be handled explicitly to avoid incorrect results.

Summary statistics are the foundation for understanding data before deeper analysis or visualization.