What will be the folder structure created when writing a DataFrame partitioned by two columns?
Consider this Spark code:
df.write.partitionBy("year", "month").parquet("/data/events")The DataFrame has data for years 2022 and 2023, and months 1 and 2.
Which folder paths will be created inside /data/events?
Partition folders are created in the order of the columns specified in partitionBy.
When writing with partitionBy("year", "month"), Spark creates nested folders first by year, then by month inside each year folder.
You write a DataFrame with 3 partitions using partitionBy("category"). The DataFrame has 2 unique categories: 'A' and 'B'.
If the DataFrame has 6 total partitions before writing, how many parquet files will be created inside the output folder?
Each partition of the DataFrame writes one file per partition folder.
Even though there are 2 categories, the DataFrame has 6 partitions. Each partition writes one file, so total files = 6.
What error will this Spark code produce?
df.write.partitionBy("nonexistent_column").parquet("/output/path")The DataFrame df does not have a column named nonexistent_column.
Partition columns must exist in the DataFrame schema.
Spark throws an AnalysisException if the partition column is missing from the DataFrame.
You have a large sales dataset with columns: date, region, product, and sales_amount.
You want to write the data partitioned to speed up queries filtering by region and date. Which partitioning strategy is best?
Partition columns should match common filter columns in queries.
Partitioning by both region and date helps queries filtering on these columns run faster by pruning partitions.
Which statement best describes the impact of partitioning output data on data skew and shuffle during a Spark write?
Think about how data distribution affects partitioning.
Partitioning can reduce shuffle by grouping data but if partition keys are uneven, some partitions become very large causing skew.