Challenge - 5 Problems
Parquet Mastery Badge
Get all challenges correct to earn this badge!
Test your skills under time pressure!
❓ Predict Output
intermediate2:00remaining
Output of reading Parquet file with Spark
What will be the output of the following Spark code snippet when reading a Parquet file and showing the first 3 rows?
Apache Spark
from pyspark.sql import SparkSession spark = SparkSession.builder.appName('Test').getOrCreate() df = spark.read.parquet('data/sample.parquet') df.show(3)
Attempts:
2 left
💡 Hint
The show() method displays rows of the DataFrame in a table format.
✗ Incorrect
The show(3) method prints the first 3 rows of the DataFrame read from the Parquet file. If the file exists and is readable, it outputs the data in tabular form.
🧠 Conceptual
intermediate1:30remaining
Why is Parquet format efficient for big data?
Which of the following best explains why Parquet format is efficient for big data processing?
Attempts:
2 left
💡 Hint
Think about how columnar storage helps with reading only needed data.
✗ Incorrect
Parquet stores data in a columnar format which allows reading only the columns needed for a query. It also compresses data efficiently, reducing storage and speeding up processing.
❓ data_output
advanced2:00remaining
Result of filtering Parquet data by column
Given a Parquet file with columns 'name', 'age', and 'city', what will be the output count after running this Spark code?
Apache Spark
df = spark.read.parquet('data/people.parquet') filtered_df = df.filter(df.age > 30) count = filtered_df.count() print(count)
Attempts:
2 left
💡 Hint
Filter keeps only rows matching the condition.
✗ Incorrect
The filter(df.age > 30) keeps rows where age is greater than 30. The count() returns how many such rows exist.
🔧 Debug
advanced2:00remaining
Identify the error in Parquet write code
What error will this Spark code produce when trying to write a DataFrame to Parquet?
Apache Spark
df = spark.createDataFrame([(1, 'Alice'), (2, 'Bob')], ['id', 'name']) df.write.parquet('output/data.parquet', mode='overwrite', format='csv')
Attempts:
2 left
💡 Hint
Check the method signature for write.parquet().
✗ Incorrect
The write.parquet() method does not accept a 'format' argument. To write in CSV, use write.format('csv').save().
🚀 Application
expert2:30remaining
Optimizing query performance with Parquet and column pruning
You have a large Parquet dataset with many columns. You want to speed up a Spark query that only needs 'user_id' and 'purchase_amount'. Which approach will best improve performance?
Attempts:
2 left
💡 Hint
Think about how columnar storage helps avoid reading unnecessary data.
✗ Incorrect
Parquet's columnar format allows Spark to read only the requested columns, reducing I/O and speeding up queries. Selecting columns after reading all data wastes resources.