What is the output of the following code that checks for duplicate rows in a DataFrame?
import pandas as pd data = {'Name': ['Anna', 'Bob', 'Anna', 'David'], 'Age': [25, 30, 25, 40]} df = pd.DataFrame(data) result = df.duplicated() print(result.tolist())
Look at how pandas marks the first occurrence of a duplicate as False and subsequent ones as True.
The duplicated() method marks the first occurrence of a row as False and any later duplicates as True. Here, the third row is a duplicate of the first.
What does the DataFrame look like after removing duplicate rows?
import pandas as pd data = {'City': ['NY', 'LA', 'NY', 'Chicago'], 'Population': [8000, 4000, 8000, 2700]} df = pd.DataFrame(data) clean_df = df.drop_duplicates() print(clean_df.reset_index(drop=True))
Remember that drop_duplicates() keeps the first occurrence and removes later duplicates.
The duplicate row with City 'NY' and Population 8000 is removed, leaving only unique rows.
Which of the following is the main reason why duplicate data can lead to incorrect analysis results?
Think about how repeated data affects averages or totals.
Duplicates cause bias because they count the same information more than once, which can distort averages, sums, and other statistics.
What error will this code produce when trying to detect duplicates?
import pandas as pd data = {'A': [1, 2, 2], 'B': [3, 4, 4]} df = pd.DataFrame(data) result = df.duplicated(axis=2) print(result)
Check the valid values for the axis parameter in duplicated().
The duplicated() method accepts axis=0 or axis=1. Using axis=2 causes a ValueError.
How can duplicate records in a training dataset affect the performance of a machine learning model?
Consider how repeated examples influence the model's learning process.
Duplicates can cause overfitting because the model learns patterns too specifically from repeated data, reducing its ability to generalize.