0
0
Apache Sparkdata~20 mins

Parquet format and columnar storage in Apache Spark - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
Parquet Mastery Badge
Get all challenges correct to earn this badge!
Test your skills under time pressure!
Predict Output
intermediate
2:00remaining
Output of reading Parquet file with Spark
What will be the output of the following Spark code snippet when reading a Parquet file and showing the first 3 rows?
Apache Spark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Test').getOrCreate()
df = spark.read.parquet('data/sample.parquet')
df.show(3)
AThrows a SyntaxError due to incorrect method usage
BDisplays first 3 rows of the DataFrame with all columns in tabular format
CRaises FileNotFoundError because 'data/sample.parquet' does not exist
DPrints schema of the DataFrame instead of data rows
Attempts:
2 left
💡 Hint
The show() method displays rows of the DataFrame in a table format.
🧠 Conceptual
intermediate
1:30remaining
Why is Parquet format efficient for big data?
Which of the following best explains why Parquet format is efficient for big data processing?
AParquet compresses data and stores it in a columnar format allowing faster reads of specific columns
BParquet stores data in a row-based format making it faster for all queries
CParquet requires no schema which speeds up data loading
DParquet stores data as plain text making it easy to read
Attempts:
2 left
💡 Hint
Think about how columnar storage helps with reading only needed data.
data_output
advanced
2:00remaining
Result of filtering Parquet data by column
Given a Parquet file with columns 'name', 'age', and 'city', what will be the output count after running this Spark code?
Apache Spark
df = spark.read.parquet('data/people.parquet')
filtered_df = df.filter(df.age > 30)
count = filtered_df.count()
print(count)
ATotal number of rows in the DataFrame regardless of age
BRaises AnalysisException due to missing column 'age'
CZero, because filter syntax is incorrect
DNumber of rows where age is greater than 30
Attempts:
2 left
💡 Hint
Filter keeps only rows matching the condition.
🔧 Debug
advanced
2:00remaining
Identify the error in Parquet write code
What error will this Spark code produce when trying to write a DataFrame to Parquet?
Apache Spark
df = spark.createDataFrame([(1, 'Alice'), (2, 'Bob')], ['id', 'name'])
df.write.parquet('output/data.parquet', mode='overwrite', format='csv')
ATypeError because 'format' is not a valid argument for write.parquet()
BWrites the DataFrame successfully in CSV format
CAnalysisException due to missing output directory
DSyntaxError due to incorrect method chaining
Attempts:
2 left
💡 Hint
Check the method signature for write.parquet().
🚀 Application
expert
2:30remaining
Optimizing query performance with Parquet and column pruning
You have a large Parquet dataset with many columns. You want to speed up a Spark query that only needs 'user_id' and 'purchase_amount'. Which approach will best improve performance?
ARead the entire Parquet file and then select 'user_id' and 'purchase_amount' columns
BConvert Parquet to CSV first, then read only needed columns
CUse Spark to read only 'user_id' and 'purchase_amount' columns from the Parquet file directly
DLoad Parquet data into memory fully before filtering columns
Attempts:
2 left
💡 Hint
Think about how columnar storage helps avoid reading unnecessary data.