Apache Sparkdata~30 mins

Understanding partitions in Apache Spark - Hands-On Activity

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Understanding partitions

📖 Scenario: You are working with a large dataset of sales records in a Spark environment. To improve performance, you want to understand how Spark divides data into partitions.

🎯 Goal: Learn how to check the number of partitions in a Spark DataFrame and how to change it.

📋 What You'll Learn

Create a Spark DataFrame with sample sales data

Create a variable to hold the desired number of partitions

Repartition the DataFrame using the variable

Print the number of partitions before and after repartitioning

💡 Why This Matters

🌍 Real World

Data scientists and engineers often need to manage how data is split across machines to optimize processing speed and resource use.

💼 Career

Understanding partitions is key for working efficiently with big data tools like Apache Spark in roles such as data engineer, data scientist, and big data analyst.

Progress0 / 4 steps

Create a Spark DataFrame with sales data

Create a Spark DataFrame called sales_df with these exact rows: (1, '2023-01-01', 100), (2, '2023-01-02', 150), (3, '2023-01-03', 200). Use columns named 'id', 'date', and 'amount'.

Apache Spark

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('PartitionsExample').getOrCreate()
# Create sales_df DataFrame with the specified rows and columns
# Your code here

Need a hint?

Use spark.createDataFrame() with a list of tuples and a list of column names.

Set the desired number of partitions

Create a variable called num_partitions and set it to 2.

Apache Spark

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('PartitionsExample').getOrCreate()
sales_data = [(1, '2023-01-01', 100), (2, '2023-01-02', 150), (3, '2023-01-03', 200)]
sales_df = spark.createDataFrame(sales_data, ['id', 'date', 'amount'])
# Create num_partitions variable and set it to 2
# Your code here

Need a hint?

Just create a variable named num_partitions and assign the number 2.

Repartition the DataFrame

Create a new DataFrame called repartitioned_df by repartitioning sales_df using num_partitions.

Apache Spark

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('PartitionsExample').getOrCreate()
sales_data = [(1, '2023-01-01', 100), (2, '2023-01-02', 150), (3, '2023-01-03', 200)]
sales_df = spark.createDataFrame(sales_data, ['id', 'date', 'amount'])
num_partitions = 2
# Repartition sales_df using num_partitions and assign to repartitioned_df
# Your code here

Need a hint?

Use the repartition() method on sales_df with num_partitions.

Print the number of partitions before and after repartitioning

Print the number of partitions in sales_df using rdd.getNumPartitions(). Then print the number of partitions in repartitioned_df the same way.

Apache Spark

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('PartitionsExample').getOrCreate()
sales_data = [(1, '2023-01-01', 100), (2, '2023-01-02', 150), (3, '2023-01-03', 200)]
sales_df = spark.createDataFrame(sales_data, ['id', 'date', 'amount'])
num_partitions = 2
repartitioned_df = sales_df.repartition(num_partitions)
# Print number of partitions in sales_df and repartitioned_df
# Your code here

Need a hint?

Use print(sales_df.rdd.getNumPartitions()) and print(repartitioned_df.rdd.getNumPartitions()).