Recall & Review
beginner
What does partitioning mean when writing output in Apache Spark?
Partitioning means dividing the output data into separate folders or files based on the values of one or more columns. This helps organize data and makes it faster to read specific parts later.
Click to reveal answer
beginner
How do you write a DataFrame in Spark with partitioning by a column named 'year'?
You use the
partitionBy method before saving. For example:<br>df.write.partitionBy('year').parquet('path')<br>This saves data in folders by year.Click to reveal answer
intermediate
Why is partitioning output useful in big data processing?
Partitioning helps by:<br>1. Organizing data for easy access.<br>2. Improving query speed by reading only needed partitions.<br>3. Managing large datasets efficiently.
Click to reveal answer
intermediate
What happens if you write output with multiple partition columns in Spark?
Spark creates nested folders for each partition column. For example, partitioning by 'year' and 'month' creates folders like
year=2023/month=06/.Click to reveal answer
intermediate
Can you overwrite data when writing with partitioning in Spark? How?
Yes, use
mode('overwrite') with partitionBy. For example:<br>df.write.mode('overwrite').partitionBy('year').parquet('path')<br>This replaces existing data in those partitions.Click to reveal answer
What does the
partitionBy method do when writing a DataFrame in Spark?✗ Incorrect
partitionBy organizes output data into folders by column values, making data easier to manage and query.
If you partition output by 'country' and 'year', how will Spark organize the files?
✗ Incorrect
Spark creates nested folders for each partition column, so data is stored in folders by country and year.
Which file format is commonly used with partitioned output in Spark for efficient storage?
✗ Incorrect
Parquet is a columnar format that works well with partitioning for fast queries and compression.
What does
mode('overwrite') do when writing partitioned data?✗ Incorrect
Using mode('overwrite') replaces existing data in the specified partitions with new data.
Why might you want to avoid partitioning by too many columns?
✗ Incorrect
Too many partitions create many small files, which can hurt performance and increase overhead.
Explain how partitioning output in Spark helps with data organization and query performance.
Think about how sorting papers into labeled folders helps find them faster.
You got /3 concepts.
Describe the steps and code to write a Spark DataFrame partitioned by 'region' and 'month' in Parquet format.
Remember to chain methods in the right order.
You got /4 concepts.