How to Use distinct in dplyr for Unique Rows in R
Use
distinct() from the dplyr package to select unique rows from a data frame. You can specify columns to find unique combinations or use it without arguments to remove duplicate rows entirely.Syntax
The basic syntax of distinct() is:
distinct(data, ...): Returns unique rows fromdata....: Optional columns to consider for uniqueness..keep_all = FALSE: IfTRUE, keeps all columns, not just those used for uniqueness.
r
distinct(data, ... , .keep_all = FALSE)
Example
This example shows how to use distinct() to get unique rows based on one or more columns and how to keep all columns.
r
library(dplyr) # Sample data frame data <- tibble( id = c(1, 2, 2, 3, 3, 3), name = c("Alice", "Bob", "Bob", "Carol", "Carol", "Carol"), score = c(10, 20, 20, 30, 30, 40) ) # Distinct rows based on all columns distinct_all <- distinct(data) # Distinct rows based on 'id' and 'name' only distinct_id_name <- distinct(data, id, name) # Distinct rows based on 'id' and 'name', keep all columns distinct_keep_all <- distinct(data, id, name, .keep_all = TRUE) list(distinct_all = distinct_all, distinct_id_name = distinct_id_name, distinct_keep_all = distinct_keep_all)
Output
distinct_all
# A tibble: 5 ร 3
id name score
<dbl> <chr> <dbl>
1 1 Alice 10
2 2 Bob 20
3 3 Carol 30
4 3 Carol 40
distinct_id_name
# A tibble: 3 ร 2
id name
<dbl> <chr>
1 1 Alice
2 2 Bob
3 3 Carol
distinct_keep_all
# A tibble: 3 ร 3
id name score
<dbl> <chr> <dbl>
1 1 Alice 10
2 2 Bob 20
3 3 Carol 30
Common Pitfalls
Common mistakes include:
- Not specifying columns when you want uniqueness based on specific columns, which may return unexpected rows.
- Forgetting
.keep_all = TRUEwhen you want to keep all columns but distinct by some columns. - Using
distinct()on unsorted data when order matters, as it keeps the first occurrence.
r
library(dplyr) data <- tibble( id = c(1, 2, 2, 3), score = c(10, 20, 30, 40) ) # Wrong: distinct by 'id' but only returns 'id' column wrong <- distinct(data, id) # Right: distinct by 'id' but keep all columns right <- distinct(data, id, .keep_all = TRUE) list(wrong = wrong, right = right)
Output
wrong
# A tibble: 3 ร 1
id
<dbl>
1 1
2 2
3 3
right
# A tibble: 3 ร 2
id score
<dbl> <dbl>
1 1 10
2 2 20
3 3 40
Quick Reference
Summary tips for using distinct():
- Use
distinct(data)to remove duplicate rows entirely. - Use
distinct(data, col1, col2)to get unique combinations of specific columns. - Use
.keep_all = TRUEto keep all columns when selecting distinct rows by some columns. - Remember it keeps the first occurrence of duplicates.
Key Takeaways
Use distinct() to select unique rows from a data frame in dplyr.
Specify columns inside distinct() to find unique combinations of those columns.
Use .keep_all = TRUE to keep all columns when filtering distinct rows by some columns.
distinct() keeps the first occurrence of duplicates and removes later ones.
Without arguments, distinct() removes duplicate rows based on all columns.