Challenge - 5 Problems

🎖️

Data Format Performance Master

Get all challenges correct to earn this badge!

Test your skills under time pressure!

🧠 Conceptual

intermediate

2:00remaining

Why does Parquet format improve query speed compared to CSV?

In Apache Spark, you have two datasets: one stored as CSV files and another as Parquet files. Both contain the same data. Why does reading Parquet files usually result in faster query performance?

AParquet files are compressed and store data in a columnar format, allowing Spark to read only needed columns, reducing disk I/O.

BCSV files are binary and harder to parse, so Spark spends more time decoding them.

CParquet files are stored in memory by default, so reading them is instant.

DCSV files contain metadata that slows down Spark's query optimizer.

Attempts:

2 left

❓ data_output

intermediate

2:00remaining

Comparing file sizes of CSV and Parquet

You save the same DataFrame in Spark as CSV and Parquet formats. Which option correctly shows the typical size difference between the two files?

Apache Spark

df.write.csv('data_csv')
df.write.parquet('data_parquet')

# After saving, check file sizes of 'data_csv' and 'data_parquet' folders

AThe Parquet folder size is larger because it stores extra metadata.

BThe CSV folder size is smaller because CSV is a plain text format.

CBoth folders have the same size because they store the same data.

DThe Parquet folder size is usually smaller than the CSV folder size due to compression.

Attempts:

2 left

❓ Predict Output

advanced

2:00remaining

What is the output of this Spark code reading JSON vs Parquet?

Consider this Spark code snippet:

df_json = spark.read.json('data.json')
df_parquet = spark.read.parquet('data.parquet')

print(df_json.count())
print(df_parquet.count())

Assuming both files contain the same data, what will be the output?

Apache Spark

df_json = spark.read.json('data.json')
df_parquet = spark.read.parquet('data.parquet')

print(df_json.count())
print(df_parquet.count())

Adf_parquet.count() prints zero because Parquet files need special schema.

BBoth print the same number of rows because the data is identical.

Cdf_json.count() is slower and prints fewer rows due to parsing errors.

Ddf_json.count() raises an error because JSON is not supported.

Attempts:

2 left

❓ visualization

advanced

2:00remaining

Which plot best shows performance difference by data format?

You run the same query on datasets stored as CSV, JSON, and Parquet. You want to visualize query execution time for each format. Which plot type best shows this comparison clearly?

AA line chart showing execution time over multiple runs for one format only.

BA pie chart showing percentage of total execution time per format.

CA bar chart with data formats on the x-axis and execution time on the y-axis.

DA scatter plot with random points representing execution times.

Attempts:

2 left

🔧 Debug

expert

3:00remaining

Why does this Spark job run slowly with CSV but fast with Parquet?

You have this Spark code:

df = spark.read.csv('data.csv', header=True)
df_filtered = df.filter(df['age'] > 30)
df_filtered.show()

The same filter on a Parquet file runs much faster. What is the main reason for this difference?

ACSV format lacks column statistics and schema, so Spark cannot push down filters and must scan all data.

BParquet files are stored in memory, so filtering is instant.

CCSV files are encrypted by default, slowing down reading.

DSpark applies filter only on Parquet files, ignoring filters on CSV.

Attempts:

2 left