Challenge - 5 Problems

🎖️

Parquet Mastery Badge

Get all challenges correct to earn this badge!

Test your skills under time pressure!

❓ Predict Output

intermediate

2:00remaining

Output of reading Parquet file with Spark

What will be the output of the following Spark code snippet when reading a Parquet file and showing the first 3 rows?

Apache Spark

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Test').getOrCreate()
df = spark.read.parquet('data/sample.parquet')
df.show(3)

AThrows a SyntaxError due to incorrect method usage

BDisplays first 3 rows of the DataFrame with all columns in tabular format

CRaises FileNotFoundError because 'data/sample.parquet' does not exist

DPrints schema of the DataFrame instead of data rows

Attempts:

2 left

🧠 Conceptual

intermediate

1:30remaining

Why is Parquet format efficient for big data?

Which of the following best explains why Parquet format is efficient for big data processing?

AParquet compresses data and stores it in a columnar format allowing faster reads of specific columns

BParquet stores data in a row-based format making it faster for all queries

CParquet requires no schema which speeds up data loading

DParquet stores data as plain text making it easy to read

Attempts:

2 left

❓ data_output

advanced

2:00remaining

Result of filtering Parquet data by column

Given a Parquet file with columns 'name', 'age', and 'city', what will be the output count after running this Spark code?

Apache Spark

df = spark.read.parquet('data/people.parquet')
filtered_df = df.filter(df.age > 30)
count = filtered_df.count()
print(count)

ATotal number of rows in the DataFrame regardless of age

BRaises AnalysisException due to missing column 'age'

CZero, because filter syntax is incorrect

DNumber of rows where age is greater than 30

Attempts:

2 left

🔧 Debug

advanced

2:00remaining

Identify the error in Parquet write code

What error will this Spark code produce when trying to write a DataFrame to Parquet?

Apache Spark

df = spark.createDataFrame([(1, 'Alice'), (2, 'Bob')], ['id', 'name'])
df.write.parquet('output/data.parquet', mode='overwrite', format='csv')

ATypeError because 'format' is not a valid argument for write.parquet()

BWrites the DataFrame successfully in CSV format

CAnalysisException due to missing output directory

DSyntaxError due to incorrect method chaining

Attempts:

2 left

🚀 Application

expert

2:30remaining

Optimizing query performance with Parquet and column pruning

You have a large Parquet dataset with many columns. You want to speed up a Spark query that only needs 'user_id' and 'purchase_amount'. Which approach will best improve performance?

ARead the entire Parquet file and then select 'user_id' and 'purchase_amount' columns

BConvert Parquet to CSV first, then read only needed columns

CUse Spark to read only 'user_id' and 'purchase_amount' columns from the Parquet file directly

DLoad Parquet data into memory fully before filtering columns

Attempts:

2 left