Apache Sparkdata~30 mins

Writing output with partitioning in Apache Spark - Mini Project: Build & Apply

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Writing output with partitioning

📖 Scenario: You work at a retail company. You have sales data for different stores and dates. You want to save this data so it is easy to find sales by store.

🎯 Goal: Create a Spark DataFrame with sales data, set a partition column, write the data partitioned by store, and show the output path structure.

📋 What You'll Learn

Create a Spark DataFrame with columns: store, date, sales

Create a variable called partition_column with value 'store'

Write the DataFrame to disk partitioned by the partition_column

Print the list of partition folders created

💡 Why This Matters

🌍 Real World

Partitioning data by a column like store helps organize large datasets so queries can run faster by reading only needed partitions.

💼 Career

Data engineers and data scientists often write partitioned data to improve performance and manageability in big data systems.

Progress0 / 4 steps

Create the sales DataFrame

Create a Spark DataFrame called sales_df with these rows exactly: ('StoreA', '2024-01-01', 100), ('StoreB', '2024-01-01', 150), ('StoreA', '2024-01-02', 200). The columns must be store, date, and sales.

Apache Spark

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('PartitionExample').getOrCreate()
# Create the sales_df DataFrame with the exact rows and columns
# Your code here

Need a hint?

Use spark.createDataFrame() with a list of tuples and a list of column names.

Set the partition column

Create a variable called partition_column and set it to the string 'store'.

Apache Spark

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('PartitionExample').getOrCreate()
data = [('StoreA', '2024-01-01', 100), ('StoreB', '2024-01-01', 150), ('StoreA', '2024-01-02', 200)]
columns = ['store', 'date', 'sales']
sales_df = spark.createDataFrame(data, schema=columns)
# Create partition_column variable with value 'store'
# Your code here

Need a hint?

Just assign the string 'store' to the variable partition_column.

Write the DataFrame partitioned by store

Write the sales_df DataFrame to disk in Parquet format, partitioned by the column stored in partition_column. Use the path 'output/sales_data'. Use mode('overwrite') to replace existing data.

Apache Spark

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('PartitionExample').getOrCreate()
data = [('StoreA', '2024-01-01', 100), ('StoreB', '2024-01-01', 150), ('StoreA', '2024-01-02', 200)]
columns = ['store', 'date', 'sales']
sales_df = spark.createDataFrame(data, schema=columns)
partition_column = 'store'
# Write sales_df partitioned by partition_column to 'output/sales_data' with overwrite mode
# Your code here

Need a hint?

Use sales_df.write.mode('overwrite').partitionBy(partition_column).parquet('output/sales_data').

List the partition folders

Import os. Use os.listdir to list the folders inside 'output/sales_data'. Print the list of folder names.

Apache Spark

from pyspark.sql import SparkSession
import os

spark = SparkSession.builder.appName('PartitionExample').getOrCreate()
data = [('StoreA', '2024-01-01', 100), ('StoreB', '2024-01-01', 150), ('StoreA', '2024-01-02', 200)]
columns = ['store', 'date', 'sales']
sales_df = spark.createDataFrame(data, schema=columns)
partition_column = 'store'
sales_df.write.mode('overwrite').partitionBy(partition_column).parquet('output/sales_data')
# List and print the folders inside 'output/sales_data'
# Your code here

Need a hint?

Use os.listdir('output/sales_data') and print the result.