0
0
Apache Sparkdata~30 mins

Understanding partitions in Apache Spark - Hands-On Activity

Choose your learning style9 modes available
Understanding partitions
📖 Scenario: You are working with a large dataset of sales records in a Spark environment. To improve performance, you want to understand how Spark divides data into partitions.
🎯 Goal: Learn how to check the number of partitions in a Spark DataFrame and how to change it.
📋 What You'll Learn
Create a Spark DataFrame with sample sales data
Create a variable to hold the desired number of partitions
Repartition the DataFrame using the variable
Print the number of partitions before and after repartitioning
💡 Why This Matters
🌍 Real World
Data scientists and engineers often need to manage how data is split across machines to optimize processing speed and resource use.
💼 Career
Understanding partitions is key for working efficiently with big data tools like Apache Spark in roles such as data engineer, data scientist, and big data analyst.
Progress0 / 4 steps
1
Create a Spark DataFrame with sales data
Create a Spark DataFrame called sales_df with these exact rows: (1, '2023-01-01', 100), (2, '2023-01-02', 150), (3, '2023-01-03', 200). Use columns named 'id', 'date', and 'amount'.
Apache Spark
Need a hint?

Use spark.createDataFrame() with a list of tuples and a list of column names.

2
Set the desired number of partitions
Create a variable called num_partitions and set it to 2.
Apache Spark
Need a hint?

Just create a variable named num_partitions and assign the number 2.

3
Repartition the DataFrame
Create a new DataFrame called repartitioned_df by repartitioning sales_df using num_partitions.
Apache Spark
Need a hint?

Use the repartition() method on sales_df with num_partitions.

4
Print the number of partitions before and after repartitioning
Print the number of partitions in sales_df using rdd.getNumPartitions(). Then print the number of partitions in repartitioned_df the same way.
Apache Spark
Need a hint?

Use print(sales_df.rdd.getNumPartitions()) and print(repartitioned_df.rdd.getNumPartitions()).