0
0
Data-analysis-pythonHow-ToBeginner ยท 3 min read

How to Join Datasets in Python: Simple Guide with Examples

To join datasets in Python, use the pandas library's merge() function to combine dataframes based on common columns or indexes. You can specify the type of join like inner, left, right, or outer to control how rows are matched and combined.
๐Ÿ“

Syntax

The basic syntax to join two datasets (dataframes) in Python using pandas is:

  • pd.merge(left, right, how='inner', on='key_column')

Where:

  • left and right are the two dataframes to join.
  • how specifies the type of join: 'inner' (default), 'left', 'right', or 'outer'.
  • on is the column name(s) to join on.
python
import pandas as pd

# Syntax pattern
df_merged = pd.merge(left_df, right_df, how='inner', on='key_column')
๐Ÿ’ป

Example

This example shows how to join two datasets on a common column called id. It demonstrates an inner join which keeps only matching rows.

python
import pandas as pd

# Create first dataset
data1 = {'id': [1, 2, 3], 'name': ['Alice', 'Bob', 'Charlie']}
df1 = pd.DataFrame(data1)

# Create second dataset
data2 = {'id': [2, 3, 4], 'age': [25, 30, 22]}
df2 = pd.DataFrame(data2)

# Join datasets on 'id' with inner join
joined_df = pd.merge(df1, df2, how='inner', on='id')

print(joined_df)
Output
id name age 0 2 Bob 25 1 3 Charlie 30
โš ๏ธ

Common Pitfalls

Common mistakes when joining datasets include:

  • Not specifying the on parameter correctly, causing unexpected joins.
  • Using the wrong how type and losing data unintentionally.
  • Joining on columns with different data types, which causes errors.
  • Forgetting to reset index if joining on index is intended.

Always check your key columns and join type before merging.

python
import pandas as pd

# Wrong: joining on different column names without specifying 'left_on' and 'right_on'
data1 = {'id1': [1, 2], 'val': ['A', 'B']}
data2 = {'id2': [2, 3], 'val': ['C', 'D']}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)

# This will raise an error or produce empty result
# wrong_join = pd.merge(df1, df2, on='id1')  # KeyError

# Correct way specifying different column names
correct_join = pd.merge(df1, df2, left_on='id1', right_on='id2', how='inner')
print(correct_join)
Output
id1 val id2 val_y 0 2 B 2 C
๐Ÿ“Š

Quick Reference

ParameterDescriptionExample Values
leftLeft dataframe to joindf1
rightRight dataframe to joindf2
howType of join'inner', 'left', 'right', 'outer'
onColumn(s) to join on'id'
left_onLeft dataframe join column'id1'
right_onRight dataframe join column'id2'
โœ…

Key Takeaways

Use pandas merge() to join datasets by common columns or indexes.
Choose the join type (inner, left, right, outer) to control which rows appear.
Always specify the correct column names with on, left_on, and right_on parameters.
Check that join columns have matching data types to avoid errors.
Test joins on small data samples to verify results before applying to large datasets.