How to Use full_join in dplyr for Combining Data Frames
Use
full_join() from the dplyr package to merge two data frames by one or more keys, keeping all rows from both tables. Rows without matches will have NA in the unmatched columns. Specify the joining columns with the by argument.Syntax
The basic syntax of full_join() is:
full_join(x, y, by = NULL, ...)
Where:
xandyare the two data frames to join.byspecifies the column(s) to join on. IfNULL, it uses columns with the same names in both.- Additional arguments can control suffixes for overlapping column names.
r
full_join(x, y, by = NULL, suffix = c(".x", ".y"), ...)
Example
This example shows how to join two data frames by a common column id. The result keeps all rows from both data frames, filling missing values with NA.
r
library(dplyr) # Create first data frame df1 <- data.frame(id = c(1, 2, 3), value1 = c("A", "B", "C")) # Create second data frame df2 <- data.frame(id = c(2, 3, 4), value2 = c("X", "Y", "Z")) # Perform full join by 'id' result <- full_join(df1, df2, by = "id") print(result)
Output
id value1 value2
1 1 A <NA>
2 2 B X
3 3 C Y
4 4 <NA> Z
Common Pitfalls
Common mistakes when using full_join() include:
- Not specifying the
byargument when the key columns have different names in each data frame. - Unexpected duplicate columns if the join keys are not unique.
- Confusing
full_join()withinner_join()orleft_join(), which keep fewer rows.
Example of specifying different key names:
r
df1 <- data.frame(key1 = c(1, 2), val1 = c("A", "B")) df2 <- data.frame(key2 = c(2, 3), val2 = c("X", "Y")) # Wrong: no 'by' specified, will not join correctly wrong_join <- full_join(df1, df2) # Right: specify keys with named vector right_join <- full_join(df1, df2, by = c("key1" = "key2")) print(wrong_join) print(right_join)
Output
key1 val1 key2 val2
1 1 A NA <NA>
2 2 B NA <NA>
3 NA <NA> 2 X
4 NA <NA> 3 Y
key1 val1 val2
1 1 A <NA>
2 2 B X
3 NA <NA> Y
Quick Reference
| Argument | Description |
|---|---|
| x | First data frame to join |
| y | Second data frame to join |
| by | Column name(s) to join on; can be NULL or named vector for different names |
| suffix | Suffixes added to overlapping non-key columns (default: .x, .y) |
| ... | Additional arguments passed to methods |
Key Takeaways
Use full_join() to keep all rows from both data frames, matching by key columns.
Always specify the by argument when join keys have different names in each data frame.
Rows without matches get NA in the columns from the other data frame.
full_join() differs from inner_join() and left_join() by including unmatched rows from both sides.
Check for duplicate keys to avoid unexpected row duplication in the result.