What is the output DataFrame after running the code below?
import pandas as pd data = {'A': [1, 2, 2, 3, 4], 'B': [5, 6, 6, 7, 8], 'C': [9, 10, 11, 12, 13]} df = pd.DataFrame(data) result = df.drop_duplicates(subset=['A', 'B']) print(result)
drop_duplicates keeps the first occurrence of each duplicate based on the subset columns.
The method drop_duplicates with subset=['A', 'B'] removes rows where the combination of columns A and B repeats. The first occurrence is kept, so the second row with A=2 and B=6 is kept, but the third row with the same values is dropped.
What is the number of duplicate rows based on columns 'X' and 'Y' in the DataFrame?
import pandas as pd data = {'X': [1, 2, 2, 3, 3, 3], 'Y': ['a', 'b', 'b', 'c', 'c', 'd'], 'Z': [10, 20, 20, 30, 30, 40]} df = pd.DataFrame(data) duplicates = df.duplicated(subset=['X', 'Y']) count = duplicates.sum() print(count)
duplicated() marks all but the first occurrence as duplicates.
Rows with (X=2, Y='b') and (X=3, Y='c') each have duplicates. The second occurrence of each is marked true, so total duplicates are 2.
What error will the following code produce?
import pandas as pd data = {'A': [1, 2, 2], 'B': [3, 4, 4]} df = pd.DataFrame(data) filtered = df[df.duplicated(columns=['A'])] print(filtered)
Check the parameter name for specifying columns in duplicated().
The duplicated() method uses 'subset' as the parameter name for columns, not 'columns'. Using 'columns' causes a TypeError.
Which code snippet correctly filters the DataFrame to keep only rows that have duplicates based on columns 'M' and 'N'?
import pandas as pd data = {'M': [1, 2, 2, 3, 3, 3], 'N': ['x', 'y', 'y', 'z', 'z', 'w'], 'O': [5, 6, 7, 8, 9, 10]} df = pd.DataFrame(data)
Use keep=False to mark all duplicates, not just later ones.
duplicated(keep=False) marks all duplicates true, so filtering with it keeps all duplicate rows. Other options either keep only later duplicates or remove duplicates.
Consider a DataFrame with duplicate rows based on columns 'P' and 'Q'. What is the difference in output between drop_duplicates(subset=['P', 'Q'], keep='first') and drop_duplicates(subset=['P', 'Q'], keep='last')?
Think about which duplicate row is kept when using 'first' or 'last'.
The keep parameter controls which duplicate row is kept: 'first' keeps the earliest row, 'last' keeps the latest row. The rest are dropped.