What is the output DataFrame after running the following code?
import pandas as pd df = pd.DataFrame({ 'A': [1, 2, 2, 3, 3, 3], 'B': ['x', 'y', 'y', 'z', 'z', 'z'], 'C': [10, 20, 20, 30, 30, 30] }) result = df.drop_duplicates(subset=['A'], keep='last') print(result)
Remember that subset=['A'] means duplicates are checked only on column 'A'. The keep='last' parameter keeps the last occurrence.
The drop_duplicates() method removes duplicate rows based on column 'A'. Since keep='last', it keeps the last row for each unique value in 'A'. For 'A' = 2, the last row is index 2, and for 'A' = 3, the last row is index 5. The row at index 0 (A=1) is also kept as it is unique.
Given the DataFrame below, how many rows remain after removing all duplicates (no rows kept) based on columns 'A' and 'B'?
import pandas as pd df = pd.DataFrame({ 'A': [1, 2, 2, 3, 3, 3], 'B': ['x', 'y', 'y', 'z', 'z', 'w'], 'C': [10, 20, 20, 30, 30, 40] }) result = df.drop_duplicates(subset=['A', 'B'], keep=False) print(len(result))
Using keep=False removes all rows that have duplicates in the specified subset.
Rows with duplicates in columns 'A' and 'B' are removed completely. The pairs (2, 'y') and (3, 'z') appear more than once, so all those rows are removed. Remaining unique rows are (1, 'x') and (3, 'w'), so 2 rows remain.
What error will this code raise?
import pandas as pd df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}) result = df.drop_duplicates(subset='A,B') print(result)
Check the type and format of the subset argument.
The subset parameter expects a list or array of column names, not a single string with commas. Passing 'A,B' as a string looks for a column named 'A,B' which does not exist, causing a KeyError.
You have a DataFrame with sales data. You want to keep only the first sale per customer. Which code snippet achieves this?
import pandas as pd df = pd.DataFrame({ 'customer_id': [101, 102, 101, 103, 102], 'sale_amount': [200, 150, 300, 400, 100], 'date': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04', '2023-01-05'] })
Think about which column identifies customers and which occurrence to keep.
To keep the first sale per customer, remove duplicates based on 'customer_id' and keep the first occurrence. Option C does exactly this.
After using drop_duplicates() on a DataFrame, what happens to the index of the resulting DataFrame?
Consider what happens to row labels when rows are removed but no explicit index reset is done.
drop_duplicates() removes rows but keeps the original index values. This means the resulting DataFrame may have non-continuous index values with gaps where rows were dropped. To get a continuous index, you must call reset_index() explicitly.