0
0
R-programmingHow-ToBeginner ยท 3 min read

How to Use distinct in dplyr for Unique Rows in R

Use distinct() from the dplyr package to select unique rows from a data frame. You can specify columns to find unique combinations or use it without arguments to remove duplicate rows entirely.
๐Ÿ“

Syntax

The basic syntax of distinct() is:

  • distinct(data, ...): Returns unique rows from data.
  • ...: Optional columns to consider for uniqueness.
  • .keep_all = FALSE: If TRUE, keeps all columns, not just those used for uniqueness.
r
distinct(data, ... , .keep_all = FALSE)
๐Ÿ’ป

Example

This example shows how to use distinct() to get unique rows based on one or more columns and how to keep all columns.

r
library(dplyr)

# Sample data frame
data <- tibble(
  id = c(1, 2, 2, 3, 3, 3),
  name = c("Alice", "Bob", "Bob", "Carol", "Carol", "Carol"),
  score = c(10, 20, 20, 30, 30, 40)
)

# Distinct rows based on all columns
distinct_all <- distinct(data)

# Distinct rows based on 'id' and 'name' only
distinct_id_name <- distinct(data, id, name)

# Distinct rows based on 'id' and 'name', keep all columns
distinct_keep_all <- distinct(data, id, name, .keep_all = TRUE)

list(distinct_all = distinct_all, distinct_id_name = distinct_id_name, distinct_keep_all = distinct_keep_all)
Output
distinct_all # A tibble: 5 ร— 3 id name score <dbl> <chr> <dbl> 1 1 Alice 10 2 2 Bob 20 3 3 Carol 30 4 3 Carol 40 distinct_id_name # A tibble: 3 ร— 2 id name <dbl> <chr> 1 1 Alice 2 2 Bob 3 3 Carol distinct_keep_all # A tibble: 3 ร— 3 id name score <dbl> <chr> <dbl> 1 1 Alice 10 2 2 Bob 20 3 3 Carol 30
โš ๏ธ

Common Pitfalls

Common mistakes include:

  • Not specifying columns when you want uniqueness based on specific columns, which may return unexpected rows.
  • Forgetting .keep_all = TRUE when you want to keep all columns but distinct by some columns.
  • Using distinct() on unsorted data when order matters, as it keeps the first occurrence.
r
library(dplyr)

data <- tibble(
  id = c(1, 2, 2, 3),
  score = c(10, 20, 30, 40)
)

# Wrong: distinct by 'id' but only returns 'id' column
wrong <- distinct(data, id)

# Right: distinct by 'id' but keep all columns
right <- distinct(data, id, .keep_all = TRUE)

list(wrong = wrong, right = right)
Output
wrong # A tibble: 3 ร— 1 id <dbl> 1 1 2 2 3 3 right # A tibble: 3 ร— 2 id score <dbl> <dbl> 1 1 10 2 2 20 3 3 40
๐Ÿ“Š

Quick Reference

Summary tips for using distinct():

  • Use distinct(data) to remove duplicate rows entirely.
  • Use distinct(data, col1, col2) to get unique combinations of specific columns.
  • Use .keep_all = TRUE to keep all columns when selecting distinct rows by some columns.
  • Remember it keeps the first occurrence of duplicates.
โœ…

Key Takeaways

Use distinct() to select unique rows from a data frame in dplyr.
Specify columns inside distinct() to find unique combinations of those columns.
Use .keep_all = TRUE to keep all columns when filtering distinct rows by some columns.
distinct() keeps the first occurrence of duplicates and removes later ones.
Without arguments, distinct() removes duplicate rows based on all columns.