0
0
Apache Sparkdata~30 mins

Writing output with partitioning in Apache Spark - Mini Project: Build & Apply

Choose your learning style9 modes available
Writing output with partitioning
📖 Scenario: You work at a retail company. You have sales data for different stores and dates. You want to save this data so it is easy to find sales by store.
🎯 Goal: Create a Spark DataFrame with sales data, set a partition column, write the data partitioned by store, and show the output path structure.
📋 What You'll Learn
Create a Spark DataFrame with columns: store, date, sales
Create a variable called partition_column with value 'store'
Write the DataFrame to disk partitioned by the partition_column
Print the list of partition folders created
💡 Why This Matters
🌍 Real World
Partitioning data by a column like store helps organize large datasets so queries can run faster by reading only needed partitions.
💼 Career
Data engineers and data scientists often write partitioned data to improve performance and manageability in big data systems.
Progress0 / 4 steps
1
Create the sales DataFrame
Create a Spark DataFrame called sales_df with these rows exactly: ('StoreA', '2024-01-01', 100), ('StoreB', '2024-01-01', 150), ('StoreA', '2024-01-02', 200). The columns must be store, date, and sales.
Apache Spark
Need a hint?

Use spark.createDataFrame() with a list of tuples and a list of column names.

2
Set the partition column
Create a variable called partition_column and set it to the string 'store'.
Apache Spark
Need a hint?

Just assign the string 'store' to the variable partition_column.

3
Write the DataFrame partitioned by store
Write the sales_df DataFrame to disk in Parquet format, partitioned by the column stored in partition_column. Use the path 'output/sales_data'. Use mode('overwrite') to replace existing data.
Apache Spark
Need a hint?

Use sales_df.write.mode('overwrite').partitionBy(partition_column).parquet('output/sales_data').

4
List the partition folders
Import os. Use os.listdir to list the folders inside 'output/sales_data'. Print the list of folder names.
Apache Spark
Need a hint?

Use os.listdir('output/sales_data') and print the result.