How to Join Datasets in Python: Simple Guide with Examples
To join datasets in Python, use the
pandas library's merge() function to combine dataframes based on common columns or indexes. You can specify the type of join like inner, left, right, or outer to control how rows are matched and combined.Syntax
The basic syntax to join two datasets (dataframes) in Python using pandas is:
pd.merge(left, right, how='inner', on='key_column')
Where:
leftandrightare the two dataframes to join.howspecifies the type of join:'inner'(default),'left','right', or'outer'.onis the column name(s) to join on.
python
import pandas as pd # Syntax pattern df_merged = pd.merge(left_df, right_df, how='inner', on='key_column')
Example
This example shows how to join two datasets on a common column called id. It demonstrates an inner join which keeps only matching rows.
python
import pandas as pd # Create first dataset data1 = {'id': [1, 2, 3], 'name': ['Alice', 'Bob', 'Charlie']} df1 = pd.DataFrame(data1) # Create second dataset data2 = {'id': [2, 3, 4], 'age': [25, 30, 22]} df2 = pd.DataFrame(data2) # Join datasets on 'id' with inner join joined_df = pd.merge(df1, df2, how='inner', on='id') print(joined_df)
Output
id name age
0 2 Bob 25
1 3 Charlie 30
Common Pitfalls
Common mistakes when joining datasets include:
- Not specifying the
onparameter correctly, causing unexpected joins. - Using the wrong
howtype and losing data unintentionally. - Joining on columns with different data types, which causes errors.
- Forgetting to reset index if joining on index is intended.
Always check your key columns and join type before merging.
python
import pandas as pd # Wrong: joining on different column names without specifying 'left_on' and 'right_on' data1 = {'id1': [1, 2], 'val': ['A', 'B']} data2 = {'id2': [2, 3], 'val': ['C', 'D']} df1 = pd.DataFrame(data1) df2 = pd.DataFrame(data2) # This will raise an error or produce empty result # wrong_join = pd.merge(df1, df2, on='id1') # KeyError # Correct way specifying different column names correct_join = pd.merge(df1, df2, left_on='id1', right_on='id2', how='inner') print(correct_join)
Output
id1 val id2 val_y
0 2 B 2 C
Quick Reference
| Parameter | Description | Example Values |
|---|---|---|
| left | Left dataframe to join | df1 |
| right | Right dataframe to join | df2 |
| how | Type of join | 'inner', 'left', 'right', 'outer' |
| on | Column(s) to join on | 'id' |
| left_on | Left dataframe join column | 'id1' |
| right_on | Right dataframe join column | 'id2' |
Key Takeaways
Use pandas merge() to join datasets by common columns or indexes.
Choose the join type (inner, left, right, outer) to control which rows appear.
Always specify the correct column names with on, left_on, and right_on parameters.
Check that join columns have matching data types to avoid errors.
Test joins on small data samples to verify results before applying to large datasets.