Which of the following reasons best explains why data cleaning often consumes the majority of time in data analysis?
Think about what problems raw data usually has before you can analyze it.
Raw data often has errors, missing values, and inconsistencies. Fixing these issues carefully takes time, which is why data cleaning is the longest step.
Given the following Python code that replaces missing values in a DataFrame column with the column mean, what is the resulting DataFrame?
import pandas as pd import numpy as np df = pd.DataFrame({'A': [1, 2, np.nan, 4, 5]}) mean_val = df['A'].mean() df['A'] = df['A'].fillna(mean_val) print(df)
Calculate the mean ignoring NaN, then fill NaN with that mean.
The mean of [1, 2, 4, 5] is 3.0. The NaN value is replaced by 3.0.
What error will this Python code raise when trying to drop rows with missing values?
import pandas as pd
df = pd.DataFrame({'A': [1, None, 3], 'B': [4, 5, None]})
df.dropna(inplace=True, axis=2)Check the valid axis values for dropna in pandas DataFrame.
DataFrames only support axis=0 (rows) or axis=1 (columns). axis=2 is invalid and causes ValueError.
You have a dataset with some extreme outlier values in a numeric column. Which method is best to reduce their impact before analysis?
Consider a method that reduces outlier effect without losing data.
Winsorization caps extreme values to reduce their effect but keeps all data points, making it a balanced approach.
You see a boxplot of a numeric column before and after cleaning. The 'before' plot shows many points outside whiskers, the 'after' plot shows fewer. What does this indicate?
Think about what fewer points outside whiskers mean in a boxplot.
Fewer points outside whiskers after cleaning means outliers were removed or capped, improving data quality.