Challenge - 5 Problems

🎖️

Large File Mastery

Get all challenges correct to earn this badge!

Test your skills under time pressure!

❓ Predict Output

intermediate

2:00remaining

Reading a large binary file with memory mapping

What is the shape of the numpy array data after running this code snippet?

NumPy

import numpy as np
filename = 'large_file.dat'
data = np.memmap(filename, dtype='float32', mode='r', shape=(1000, 1000))
print(data.shape)

ARaises ValueError

B(1000000,)

C(1000,)

D(1000, 1000)

Attempts:

2 left

❓ data_output

intermediate

2:00remaining

Effect of chunk reading on memory usage

Given a large CSV file too big to fit in memory, which code snippet correctly reads it in chunks and prints the total number of rows?

NumPy

import pandas as pd
filename = 'large_data.csv'
chunk_size = 10000
row_count = 0
for chunk in pd.read_csv(filename, chunksize=chunk_size):
    row_count += len(chunk)
print(row_count)

ARaises a TypeError

BPrints the total number of rows in the file

CPrints the number of columns in the file

DPrints the number of chunks read

Attempts:

2 left

🔧 Debug

advanced

2:00remaining

Fixing memory error when loading large numpy array

This code tries to load a large numpy array from a file but causes a MemoryError. Which option fixes the issue by using memory mapping?

NumPy

import numpy as np
data = np.load('large_array.npy')
print(data.sum())

Adata = np.load('large_array.npy', mmap_mode='r')

Bdata = np.loadtxt('large_array.npy')

Cdata = np.memmap('large_array.npy', dtype='float64', mode='r')

Ddata = np.load('large_array.npy', allow_pickle=True)

Attempts:

2 left

❓ visualization

advanced

2:00remaining

Visualizing data read in chunks

You want to plot the sum of values in each chunk of a large CSV file. Which code snippet produces a line plot of chunk sums?

NumPy

import pandas as pd
import matplotlib.pyplot as plt
filename = 'large_data.csv'
chunk_size = 5000
chunk_sums = []
for chunk in pd.read_csv(filename, chunksize=chunk_size):
    chunk_sums.append(chunk['value'].sum())
plt.plot(chunk_sums)
plt.xlabel('Chunk number')
plt.ylabel('Sum of values')
plt.title('Sum per chunk')
plt.show()

APlots a line graph showing sum of 'value' column for each chunk

BPlots a bar chart of the number of rows per chunk

CPlots a scatter plot of 'value' vs index for the entire file

DRaises KeyError because 'value' column does not exist

Attempts:

2 left

🧠 Conceptual

expert

3:00remaining

Choosing the best method for large file processing

You have a 100GB CSV file and limited RAM (8GB). You want to compute the average of a numeric column efficiently. Which approach is best?

AConvert CSV to numpy memmap and compute mean on memmap array

BLoad entire file with pandas.read_csv() and compute mean directly

CUse pandas.read_csv() with chunksize to process file in parts and compute running average

DUse Python's built-in open() and read line by line, converting to float and summing

Attempts:

2 left