Which of the following is the main reason to use binning on continuous variables in machine learning?
Think about how grouping values can help handle extreme values.
Binning groups continuous values into intervals which reduces the impact of extreme values (outliers) and can simplify the model.
What is the output of the following Python code?
import pandas as pd values = [1, 5, 10, 15, 20] bins = [0, 5, 10, 15, 20] categories = pd.cut(values, bins) print(categories.tolist())
Check which bin each value falls into based on the intervals.
Values are assigned to bins where the right edge is included. 20 falls into the last bin [15, 20].
You have a highly skewed continuous feature. Which binning method is best to preserve information for a decision tree model?
Consider how to balance data distribution across bins.
Equal-frequency binning ensures each bin has similar sample counts, which helps with skewed data by balancing representation.
After binning a continuous variable into 4 bins, you train a logistic regression model. Which metric is most appropriate to check if binning improved model performance?
Think about how to measure model quality on unseen data.
Accuracy on validation data shows if binning helped the model generalize better.
What error does this code raise?
import pandas as pd values = [1, 2, 2, 2, 3] bins = pd.qcut(values, q=4) print(bins)
Check if the data has enough unique values for the requested bins.
qcut requires unique bin edges but repeated values cause duplicate edges, raising ValueError.