0
0
Pandasdata~20 mins

Why data exploration matters in Pandas - Challenge Your Understanding

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
Data Exploration Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
Predict Output
intermediate
2:00remaining
What is the output of this data summary code?
Given the DataFrame below, what will be the output of the df.describe() method?
Pandas
import pandas as pd

data = {'age': [25, 30, 22, 40, 28], 'income': [50000, 60000, 45000, 80000, 52000]}
df = pd.DataFrame(data)
print(df.describe())
A
       age        income
count   5.0      5.000000
mean   29.0  57400.000000
std     7.0  13928.388277
min    22.0  45000.000000
25%    25.0  50000.000000
50%    28.0  52000.000000
75%    30.0  60000.000000
max    40.0  80000.000000
B
       age        income
count   5.0      5.000000
mean   29.0  57400.000000
std     7.5  13928.388277
min    22.0  45000.000000
25%    24.0  48000.000000
50%    28.0  52000.000000
75%    30.0  60000.000000
max    40.0  80000.000000
C
       age        income
count   5.0      5.000000
mean   29.0  57400.000000
std     7.5  13928.388277
min    22.0  45000.000000
25%    25.0  50000.000000
50%    28.0  52000.000000
75%    30.0  60000.000000
max    40.0  80000.000000
D
       age        income
count   5.0      5.000000
mean   29.0  57400.000000
std     7.5  13928.388277
min    22.0  45000.000000
25%    25.0  51000.000000
50%    28.0  52000.000000
75%    30.0  60000.000000
max    40.0  80000.000000
Attempts:
2 left
💡 Hint
Look carefully at the 25% percentile values for both columns.
data_output
intermediate
2:00remaining
How many missing values are in the DataFrame?
Given the DataFrame below, what is the output of df.isnull().sum()?
Pandas
import pandas as pd
import numpy as np

data = {'name': ['Alice', 'Bob', 'Charlie', 'David'], 'age': [25, np.nan, 30, 22], 'income': [50000, 60000, np.nan, 45000]}
df = pd.DataFrame(data)
print(df.isnull().sum())
A
name      0
age       1
income    0
dtype: int64
B
name      0
age       1
income    1
dtype: int64
C
name      0
age       2
income    1
dtype: int64
D
name      1
age       1
income    1
dtype: int64
Attempts:
2 left
💡 Hint
Check which columns have missing values and count them.
visualization
advanced
2:00remaining
Which plot best shows the distribution of a numeric column?
You want to understand how the values in the 'age' column are spread out. Which plot below is best for this?
Pandas
import pandas as pd
import matplotlib.pyplot as plt

data = {'age': [22, 25, 25, 30, 35, 40, 40, 40, 45, 50]}
df = pd.DataFrame(data)

# Option A
plt.hist(df['age'], bins=5)
plt.title('Histogram of Age')
plt.show()

# Option B
plt.scatter(range(len(df)), df['age'])
plt.title('Scatter plot of Age')
plt.show()

# Option C
plt.bar(df['age'].value_counts().index, df['age'].value_counts())
plt.title('Bar chart of Age counts')
plt.show()

# Option D
plt.boxplot(df['age'])
plt.title('Boxplot of Age')
plt.show()
ABar chart - shows counts of each unique age value
BScatter plot - shows individual age values by index
CBoxplot - shows median, quartiles, and outliers of age
DHistogram - shows frequency distribution of age values
Attempts:
2 left
💡 Hint
Think about which plot shows how often each age range occurs.
🧠 Conceptual
advanced
1:30remaining
Why is data exploration important before modeling?
Which reason below best explains why exploring data is a crucial step before building a predictive model?
ATo identify patterns, detect errors, and understand data quality before modeling
BTo avoid visualizing data and rely only on automated tools
CTo skip cleaning and directly use raw data for faster results
DTo immediately train the model without checking data
Attempts:
2 left
💡 Hint
Think about what problems might happen if you don't understand your data first.
🔧 Debug
expert
2:00remaining
What error does this code raise during data exploration?
Consider this code snippet. What error will it raise when run?
Pandas
import pandas as pd

data = {'age': [25, 30, 22], 'income': [50000, 60000]}
df = pd.DataFrame(data)
print(df.describe())
AValueError: All arrays must be of the same length
BKeyError: 'income'
CTypeError: unsupported operand type(s) for +: 'int' and 'str'
DNo error, prints summary statistics
Attempts:
2 left
💡 Hint
Check if all columns have the same number of values.