Challenge - 5 Problems

🎖️

Data Exploration Master

Get all challenges correct to earn this badge!

Test your skills under time pressure!

❓ Predict Output

intermediate

2:00remaining

What is the output of this data summary code?

Given the DataFrame below, what will be the output of the df.describe() method?

Pandas

import pandas as pd

data = {'age': [25, 30, 22, 40, 28], 'income': [50000, 60000, 45000, 80000, 52000]}
df = pd.DataFrame(data)
print(df.describe())

       age        income
count   5.0      5.000000
mean   29.0  57400.000000
std     7.0  13928.388277
min    22.0  45000.000000
25%    25.0  50000.000000
50%    28.0  52000.000000
75%    30.0  60000.000000
max    40.0  80000.000000

       age        income
count   5.0      5.000000
mean   29.0  57400.000000
std     7.5  13928.388277
min    22.0  45000.000000
25%    24.0  48000.000000
50%    28.0  52000.000000
75%    30.0  60000.000000
max    40.0  80000.000000

       age        income
count   5.0      5.000000
mean   29.0  57400.000000
std     7.5  13928.388277
min    22.0  45000.000000
25%    25.0  50000.000000
50%    28.0  52000.000000
75%    30.0  60000.000000
max    40.0  80000.000000

       age        income
count   5.0      5.000000
mean   29.0  57400.000000
std     7.5  13928.388277
min    22.0  45000.000000
25%    25.0  51000.000000
50%    28.0  52000.000000
75%    30.0  60000.000000
max    40.0  80000.000000

Attempts:

2 left

❓ data_output

intermediate

2:00remaining

How many missing values are in the DataFrame?

Given the DataFrame below, what is the output of df.isnull().sum()?

Pandas

import pandas as pd
import numpy as np

data = {'name': ['Alice', 'Bob', 'Charlie', 'David'], 'age': [25, np.nan, 30, 22], 'income': [50000, 60000, np.nan, 45000]}
df = pd.DataFrame(data)
print(df.isnull().sum())

name      0
age       1
income    0
dtype: int64

name      0
age       1
income    1
dtype: int64

name      0
age       2
income    1
dtype: int64

name      1
age       1
income    1
dtype: int64

Attempts:

2 left

❓ visualization

advanced

2:00remaining

Which plot best shows the distribution of a numeric column?

You want to understand how the values in the 'age' column are spread out. Which plot below is best for this?

Pandas

import pandas as pd
import matplotlib.pyplot as plt

data = {'age': [22, 25, 25, 30, 35, 40, 40, 40, 45, 50]}
df = pd.DataFrame(data)

# Option A
plt.hist(df['age'], bins=5)
plt.title('Histogram of Age')
plt.show()

# Option B
plt.scatter(range(len(df)), df['age'])
plt.title('Scatter plot of Age')
plt.show()

# Option C
plt.bar(df['age'].value_counts().index, df['age'].value_counts())
plt.title('Bar chart of Age counts')
plt.show()

# Option D
plt.boxplot(df['age'])
plt.title('Boxplot of Age')
plt.show()

ABar chart - shows counts of each unique age value

BScatter plot - shows individual age values by index

CBoxplot - shows median, quartiles, and outliers of age

DHistogram - shows frequency distribution of age values

Attempts:

2 left

🧠 Conceptual

advanced

1:30remaining

Why is data exploration important before modeling?

Which reason below best explains why exploring data is a crucial step before building a predictive model?

ATo identify patterns, detect errors, and understand data quality before modeling

BTo avoid visualizing data and rely only on automated tools

CTo skip cleaning and directly use raw data for faster results

DTo immediately train the model without checking data

Attempts:

2 left

🔧 Debug

expert

2:00remaining

What error does this code raise during data exploration?

Consider this code snippet. What error will it raise when run?

Pandas

import pandas as pd

data = {'age': [25, 30, 22], 'income': [50000, 60000]}
df = pd.DataFrame(data)
print(df.describe())

AValueError: All arrays must be of the same length

BKeyError: 'income'

CTypeError: unsupported operand type(s) for +: 'int' and 'str'

DNo error, prints summary statistics

Attempts:

2 left