0
0
ML Pythonprogramming~5 mins

Data distributions and outliers in ML Python - Cheat Sheet & Quick Revision

Choose your learning style9 modes available
Recall & Review
beginner
What is a data distribution in machine learning?
A data distribution shows how data points are spread or arranged across different values. It helps us understand the common, rare, or unusual values in the data.
Click to reveal answer
beginner
What is an outlier in a dataset?
An outlier is a data point that is very different from most other points. It can be much higher or lower than the rest and may affect how models learn.
Click to reveal answer
intermediate
Why is it important to detect outliers before training a model?
Outliers can mislead the model by making it learn wrong patterns. Detecting them helps improve model accuracy and reliability.
Click to reveal answer
beginner
Name two common ways to visualize data distributions.
Histograms and box plots are common ways. Histograms show frequency of values, and box plots show spread and outliers.
Click to reveal answer
intermediate
How can you handle outliers in your data?
You can remove them, transform them, or use models that are less sensitive to outliers. The choice depends on the problem and data.
Click to reveal answer
What does a data distribution tell us?
AThe exact number of data points
BThe model's accuracy
COnly the average value
DHow data points are spread across values
Which of these is an example of an outlier?
AA data point close to the average
BA data point far from most others
CA missing data point
DA data point repeated many times
Which visualization is best to spot outliers?
ABox plot
BLine chart
CHistogram
DPie chart
Why might outliers be a problem for machine learning models?
AThey always improve model accuracy
BThey make data easier to understand
CThey can mislead the model to learn wrong patterns
DThey reduce the size of the dataset
What is one way to handle outliers?
ARemove or transform them
BIgnore them always
CAdd more outliers
DReplace them with zeros only
Explain what a data distribution is and why it matters in machine learning.
Describe what outliers are and how they can affect machine learning models.