0
0
Apache Sparkdata~10 mins

Understanding partitions in Apache Spark - Visual Explanation

Choose your learning style9 modes available
Concept Flow - Understanding partitions
Start with RDD/DataFrame
Check number of partitions
Perform transformations
Shuffle or narrow dependencies?
Repartition or coalesce
Action triggers execution
Tasks run on each partition
Collect or save results
End
This flow shows how Spark handles partitions from data creation, checking partitions, transformations, to execution and results.
Execution Sample
Apache Spark
rdd = sc.parallelize([1,2,3,4,5,6], 3)
rdd2 = rdd.map(lambda x: x * 2)
print(rdd2.getNumPartitions())
result = rdd2.collect()
Create an RDD with 3 partitions, double each element, check partitions, then collect results.
Execution Table
StepActionPartitionsData in PartitionsOutput/Result
1Create RDD with 3 partitions3[1,2] [3,4] [5,6]RDD created
2Apply map to double values3[2,4] [6,8] [10,12]Transformation defined (lazy)
3Check number of partitions3N/A3 partitions
4Collect triggers execution3[2,4] [6,8] [10,12]Data collected as [2,4,6,8,10,12]
💡 Execution stops after collect gathers all data from partitions.
Variable Tracker
VariableStartAfter Step 1After Step 2After Step 4
rddNone[1,2] [3,4] [5,6][1,2] [3,4] [5,6][1,2] [3,4] [5,6]
rdd2NoneNone[2,4] [6,8] [10,12][2,4] [6,8] [10,12]
resultNoneNoneNone[2,4,6,8,10,12]
Key Moments - 3 Insights
Why does the number of partitions stay the same after map transformation?
Because map is a narrow transformation that does not change partition count, as shown in step 2 and 3 of the execution_table.
When does Spark actually process the data in partitions?
Spark processes data only when an action like collect is called, as seen in step 4 where execution happens.
What happens if we want to change the number of partitions?
We use repartition or coalesce transformations before actions, which is not shown here but would change the partition count.
Visual Quiz - 3 Questions
Test your understanding
Look at the execution table, how many partitions does rdd2 have after the map transformation?
A3
B1
C6
D0
💡 Hint
Check the 'Partitions' column at step 2 and 3 in the execution_table.
At which step does Spark actually compute the doubled values?
AStep 2
BStep 4
CStep 3
DStep 1
💡 Hint
Look for when 'Data collected' happens in the 'Output/Result' column.
If we changed the initial parallelize call to 2 partitions, what would change in the execution_table?
AResult would be empty
BData in partitions would remain the same
CPartitions column would show 2 instead of 3
DNumber of partitions would increase
💡 Hint
Refer to the 'Partitions' column in step 1 and 3 of the execution_table.
Concept Snapshot
Spark partitions split data for parallel work.
Transformations like map keep partitions same.
Actions like collect trigger execution.
Repartition changes partition count.
Check partitions with getNumPartitions().
Full Transcript
This lesson shows how Spark handles partitions. We start by creating an RDD with 3 partitions. Each partition holds part of the data. When we apply a map transformation to double values, the number of partitions stays the same because map is a narrow transformation. Spark does not process data immediately; it waits until an action like collect is called. At collect, Spark runs tasks on each partition and gathers results. You can check the number of partitions anytime with getNumPartitions. To change partitions, use repartition or coalesce before actions.