Challenge - 5 Problems

🎖️

Large Dataset Mastery

Get all challenges correct to earn this badge!

Test your skills under time pressure!

❓ Predict Output

intermediate

2:00remaining

What is the output of this code when reading a large CSV in chunks?

Consider this code that reads a large CSV file in chunks and counts total rows:

import pandas as pd
chunk_iter = pd.read_csv('large_data.csv', chunksize=1000)
total_rows = 0
for chunk in chunk_iter:
    total_rows += len(chunk)
print(total_rows)

If the CSV has 4523 rows, what will be printed?

Pandas

import pandas as pd
chunk_iter = pd.read_csv('large_data.csv', chunksize=1000)
total_rows = 0
for chunk in chunk_iter:
    total_rows += len(chunk)
print(total_rows)

A4000

B4523

C5000

D1000

Attempts:

2 left

❓ data_output

intermediate

2:00remaining

What is the memory usage after downcasting numeric columns?

Given this DataFrame with integer columns:

import pandas as pd
import numpy as np
df = pd.DataFrame({
    'A': np.random.randint(0, 1000, size=10000),
    'B': np.random.randint(0, 100000, size=10000)
})
df_optimized = df.copy()
df_optimized['A'] = pd.to_numeric(df_optimized['A'], downcast='unsigned')
df_optimized['B'] = pd.to_numeric(df_optimized['B'], downcast='unsigned')
print(df_optimized.memory_usage(deep=True).sum())

Which option best describes the output?

Pandas

import pandas as pd
import numpy as np
df = pd.DataFrame({
    'A': np.random.randint(0, 1000, size=10000),
    'B': np.random.randint(0, 100000, size=10000)
})
df_optimized = df.copy()
df_optimized['A'] = pd.to_numeric(df_optimized['A'], downcast='unsigned')
df_optimized['B'] = pd.to_numeric(df_optimized['B'], downcast='unsigned')
print(df_optimized.memory_usage(deep=True).sum())

AExactly the same as original memory usage

BMore than original memory usage due to copying

CLess than original memory usage because of downcasting

DRaises a TypeError because downcast='unsigned' is invalid

Attempts:

2 left

🔧 Debug

advanced

2:00remaining

What error does this code raise when filtering a large DataFrame?

Look at this code snippet:

import pandas as pd
df = pd.DataFrame({'value': range(1000000)})
filtered = df[df['value'] > 500000]
filtered.reset_index(inplace=True)
print(filtered.head())

What error will this code produce?

Pandas

import pandas as pd
df = pd.DataFrame({'value': range(1000000)})
filtered = df[df['value'] > 500000]
filtered.reset_index(inplace=True)
print(filtered.head())

ANo error, prints first 5 rows with old index as a column

BKeyError because 'index' column does not exist

CTypeError due to inplace=True with reset_index

DValueError because filtered DataFrame is empty

Attempts:

2 left

❓ visualization

advanced

2:00remaining

Which plot best visualizes memory usage reduction after optimization?

You have original and optimized DataFrames with memory usage values:

memory_original = 8000000
memory_optimized = 2000000

Which plot code will clearly show the reduction?

Pandas

import matplotlib.pyplot as plt
memory_original = 8000000
memory_optimized = 2000000
plt.bar(['Original', 'Optimized'], [memory_original, memory_optimized])
plt.ylabel('Memory Usage (bytes)')
plt.title('Memory Usage Before and After Optimization')
plt.show()

AScatter plot of memory usage vs. row count

BLine plot with memory usage over time

CPie chart showing percentage of memory used by each column

DBar chart comparing original and optimized memory usage

Attempts:

2 left

🧠 Conceptual

expert

2:00remaining

Which strategy is best for handling datasets larger than RAM?

You have a dataset too large to fit into your computer's RAM. Which approach is best to analyze it efficiently?

AUse chunked reading with pandas.read_csv and process each chunk separately

BLoad entire dataset into memory using pandas.read_csv without chunks

CConvert dataset to Excel and open in spreadsheet software

DIncrease RAM physically and then load dataset fully

Attempts:

2 left