How to Merge Datasets in Python: Simple Guide with Examples
You can merge datasets in Python using the
pandas library with the merge() function, which combines data based on common columns or indexes. This lets you join tables like in real life when you combine information from two lists by matching keys.Syntax
The basic syntax for merging datasets with pandas is:
pandas.merge(left, right, how='inner', on=None, left_on=None, right_on=None)Here:
leftandrightare the two datasets (DataFrames) to merge.howdefines the type of merge:'inner'(default),'left','right', or'outer'.onis the column name(s) to join on if both datasets share the same column name.left_onandright_onspecify columns to join on if the column names differ.
python
import pandas as pd merged_df = pd.merge(left_df, right_df, how='inner', on='key_column')
Example
This example shows how to merge two datasets by a common column called id. It combines user names with their city information.
python
import pandas as pd # First dataset with user ids and names users = pd.DataFrame({ 'id': [1, 2, 3], 'name': ['Alice', 'Bob', 'Charlie'] }) # Second dataset with user ids and cities cities = pd.DataFrame({ 'id': [2, 3, 4], 'city': ['New York', 'Los Angeles', 'Chicago'] }) # Merge on 'id' with inner join (only matching ids) merged = pd.merge(users, cities, how='inner', on='id') print(merged)
Output
id name city
0 2 Bob New York
1 3 Charlie Los Angeles
Common Pitfalls
Common mistakes when merging datasets include:
- Not specifying the
onparameter when column names differ, causing errors. - Using the wrong
howtype, which can exclude data unintentionally. - Duplicate column names causing confusion in the result.
Always check your column names and choose the right how option for your needs.
python
import pandas as pd # Wrong: columns have different names but 'on' is missing left = pd.DataFrame({'id_left': [1, 2], 'value': ['A', 'B']}) right = pd.DataFrame({'id_right': [2, 3], 'value': ['C', 'D']}) # This will raise an error # merged_wrong = pd.merge(left, right) # Missing 'on' or 'left_on'/'right_on' # Correct way specifying columns to join on merged_right = pd.merge(left, right, left_on='id_left', right_on='id_right', how='inner') print(merged_right)
Output
id_left value_x id_right value_y
0 2 B 2 C
Quick Reference
| Parameter | Description | Example Values |
|---|---|---|
| left | First DataFrame to merge | df1 |
| right | Second DataFrame to merge | df2 |
| how | Type of merge | 'inner', 'left', 'right', 'outer' |
| on | Column(s) to join on (same name) | 'id' |
| left_on | Column(s) in left DataFrame to join on | 'id_left' |
| right_on | Column(s) in right DataFrame to join on | 'id_right' |
Key Takeaways
Use pandas.merge() to combine datasets by matching columns or indexes.
Specify the correct join type with the 'how' parameter to control which data is kept.
Always check column names and use 'on', 'left_on', and 'right_on' correctly to avoid errors.
Inner join keeps only matching rows; outer join keeps all rows from both datasets.
Merging datasets is like joining two lists by a common key to get combined information.