0
0
Data-analysis-pythonHow-ToBeginner ยท 3 min read

How to Merge Datasets in Python: Simple Guide with Examples

You can merge datasets in Python using the pandas library with the merge() function, which combines data based on common columns or indexes. This lets you join tables like in real life when you combine information from two lists by matching keys.
๐Ÿ“

Syntax

The basic syntax for merging datasets with pandas is:

pandas.merge(left, right, how='inner', on=None, left_on=None, right_on=None)

Here:

  • left and right are the two datasets (DataFrames) to merge.
  • how defines the type of merge: 'inner' (default), 'left', 'right', or 'outer'.
  • on is the column name(s) to join on if both datasets share the same column name.
  • left_on and right_on specify columns to join on if the column names differ.
python
import pandas as pd

merged_df = pd.merge(left_df, right_df, how='inner', on='key_column')
๐Ÿ’ป

Example

This example shows how to merge two datasets by a common column called id. It combines user names with their city information.

python
import pandas as pd

# First dataset with user ids and names
users = pd.DataFrame({
    'id': [1, 2, 3],
    'name': ['Alice', 'Bob', 'Charlie']
})

# Second dataset with user ids and cities
cities = pd.DataFrame({
    'id': [2, 3, 4],
    'city': ['New York', 'Los Angeles', 'Chicago']
})

# Merge on 'id' with inner join (only matching ids)
merged = pd.merge(users, cities, how='inner', on='id')
print(merged)
Output
id name city 0 2 Bob New York 1 3 Charlie Los Angeles
โš ๏ธ

Common Pitfalls

Common mistakes when merging datasets include:

  • Not specifying the on parameter when column names differ, causing errors.
  • Using the wrong how type, which can exclude data unintentionally.
  • Duplicate column names causing confusion in the result.

Always check your column names and choose the right how option for your needs.

python
import pandas as pd

# Wrong: columns have different names but 'on' is missing
left = pd.DataFrame({'id_left': [1, 2], 'value': ['A', 'B']})
right = pd.DataFrame({'id_right': [2, 3], 'value': ['C', 'D']})

# This will raise an error
# merged_wrong = pd.merge(left, right)  # Missing 'on' or 'left_on'/'right_on'

# Correct way specifying columns to join on
merged_right = pd.merge(left, right, left_on='id_left', right_on='id_right', how='inner')
print(merged_right)
Output
id_left value_x id_right value_y 0 2 B 2 C
๐Ÿ“Š

Quick Reference

ParameterDescriptionExample Values
leftFirst DataFrame to mergedf1
rightSecond DataFrame to mergedf2
howType of merge'inner', 'left', 'right', 'outer'
onColumn(s) to join on (same name)'id'
left_onColumn(s) in left DataFrame to join on'id_left'
right_onColumn(s) in right DataFrame to join on'id_right'
โœ…

Key Takeaways

Use pandas.merge() to combine datasets by matching columns or indexes.
Specify the correct join type with the 'how' parameter to control which data is kept.
Always check column names and use 'on', 'left_on', and 'right_on' correctly to avoid errors.
Inner join keeps only matching rows; outer join keeps all rows from both datasets.
Merging datasets is like joining two lists by a common key to get combined information.