Given two pandas DataFrames df1 and df2:
df1 = pd.DataFrame({"key": [1, 2, 3], "val1": ["A", "B", "C"]})
df2 = pd.DataFrame({"key": [2, 3, 4], "val2": ["X", "Y", "Z"]})
result = df1.merge(df2, on="key", how="left")
print(result)What will be printed?
import pandas as pd df1 = pd.DataFrame({"key": [1, 2, 3], "val1": ["A", "B", "C"]}) df2 = pd.DataFrame({"key": [2, 3, 4], "val2": ["X", "Y", "Z"]}) result = df1.merge(df2, on="key", how="left") print(result)
Remember, a left join keeps all rows from the left DataFrame and matches rows from the right DataFrame where keys are equal. If no match, NaN appears.
The left join keeps all keys from df1. For key=1, no match in df2, so val2 is NaN. For keys 2 and 3, matching values from df2 are joined.
When performing a left join between two tables, what happens to rows in the left table that have no matching key in the right table?
Think about what a left join means: all rows from the left table must appear.
In a left join, all rows from the left table appear. If no matching key exists in the right table, the columns from the right table are filled with NULL (or NaN in pandas).
You have two DataFrames:
df1 = pd.DataFrame({"id": [1, 2], "val1": ["A", "B"]})
df2 = pd.DataFrame({"key": [2, 3], "val2": ["X", "Y"]})Which code correctly performs a left join where df1.id matches df2.key?
When column names differ, you must specify which columns to join on from each DataFrame.
Option B uses left_on and right_on to specify the join columns with different names. Option B tries to join on both columns simultaneously, which is invalid. Option B tries to join on a column that doesn't exist in df2. Option B joins on index, which is not intended here.
You have two very large DataFrames and want to perform a left join on a key column. Which approach can improve performance?
Indexes can speed up lookups during joins.
Setting the key columns as indexes allows pandas to use faster join algorithms. Manual loops are slow. concat does not join on keys. Dropping columns may reduce memory but does not optimize join itself.
Given:
df1 = pd.DataFrame({"key": [1, 2, 2], "val1": ["A", "B", "C"]})
df2 = pd.DataFrame({"key": [2], "val2": ["X"]})
result = df1.merge(df2, on="key", how="left")
print(result)The output is:
key val1 val2 0 1 A NaN 1 2 B X 2 2 C X
Why are there two rows with key=2 in the result?
Think about how joins behave when the left table has repeated keys.
When the left DataFrame has duplicate keys, each duplicate row is matched with the corresponding rows in the right DataFrame. Since df2 has one row with key=2, both rows in df1 with key=2 get matched, resulting in two rows in the output.