How to Use summarize in dplyr: Simple Guide with Examples
In
dplyr, use summarize() to create summary statistics by applying functions like mean() or sum() to columns. It reduces data to one row per group or overall when used with group_by() or alone.Syntax
The basic syntax of summarize() is:
summarize(data, new_column = summary_function(column))data: your data frame or tibblenew_column: name for the summary resultsummary_function(column): function likemean(),sum(),n(), etc.
When combined with group_by(), it summarizes data by groups.
r
library(dplyr) # Basic syntax summarize(data, new_column = mean(column))
Example
This example shows how to calculate the average miles per gallon (mpg) for each number of cylinders (cyl) in the built-in mtcars dataset.
r
library(dplyr)
result <- mtcars %>%
group_by(cyl) %>%
summarize(avg_mpg = mean(mpg))
print(result)Output
cyl avg_mpg
1 4 26.66364
2 6 19.74286
3 8 15.10000
Common Pitfalls
Common mistakes include:
- Forgetting to use
group_by()when you want summaries by group, which results in a single summary for the whole data. - Using column names without
summarize()insidegroup_by()which does not summarize. - Not loading
dplyrlibrary before usingsummarize().
r
library(dplyr) # Wrong: no group_by, so only one summary mtcars %>% summarize(avg_mpg = mean(mpg)) # Right: with group_by to get group summaries mtcars %>% group_by(cyl) %>% summarize(avg_mpg = mean(mpg))
Output
avg_mpg
1 20.09062
cyl avg_mpg
1 4 26.66364
2 6 19.74286
3 8 15.10000
Quick Reference
| Function | Description | Example |
|---|---|---|
| mean() | Calculates average | summarize(data, avg = mean(column)) |
| sum() | Calculates total sum | summarize(data, total = sum(column)) |
| n() | Counts rows | summarize(data, count = n()) |
| median() | Calculates median | summarize(data, med = median(column)) |
| max() | Finds maximum value | summarize(data, max_val = max(column)) |
Key Takeaways
Use
summarize() to create summary statistics from data frames.Combine
summarize() with group_by() to get summaries by groups.Always load the
dplyr package before using summarize().Common summary functions include
mean(), sum(), and n().Without
group_by(), summarize() returns one summary for the entire dataset.