0
0
Apache Sparkdata~10 mins

Spark architecture (driver, executors, cluster manager) in Apache Spark - Step-by-Step Execution

Choose your learning style9 modes available
Concept Flow - Spark architecture (driver, executors, cluster manager)
User submits Spark job
Driver program starts
Driver requests resources from Cluster Manager
Cluster Manager allocates Executors
Executors run tasks
Executors send results back to Driver
Driver collects results and completes job
This flow shows how a Spark job moves from user submission through the driver, cluster manager, executors, and back to the driver for results.
Execution Sample
Apache Spark
# Pseudocode for Spark job execution
spark = SparkSession.builder.appName('Example').getOrCreate()
data = spark.read.csv('data.csv')
result = data.filter("cast(_c1 as int) > 10").count()
print(result)
This code reads data, filters rows where second column > 10, counts them, and prints the result.
Execution Table
StepComponentActionDetailsOutput
1UserSubmits jobJob with filter and countJob sent to Driver
2DriverStartsInitializes SparkContext and DAGReady to schedule tasks
3DriverRequests resourcesAsks Cluster Manager for executorsResource request sent
4Cluster ManagerAllocates executorsAssigns executors on worker nodesExecutors launched
5ExecutorsRun tasksFilter and count tasks executed on data partitionsPartial counts computed
6ExecutorsSend resultsPartial counts sent back to DriverPartial results received
7DriverAggregates resultsSums partial countsFinal count computed
8DriverJob completePrints final countOutput displayed to user
9-ExitJob finished successfully-
💡 Job completes after driver aggregates results and prints output
Variable Tracker
VariableStartAfter Step 5After Step 7Final
sparkNoneSparkSession activeSparkSession activeSparkSession active
dataNoneData loaded as DataFrameDataFrame unchangedDataFrame unchanged
partial_countsNoneList of counts from executorsAggregated sumAggregated sum
final_countNoneNoneSum of partial countsPrinted output
Key Moments - 3 Insights
Why does the driver request resources from the cluster manager before running tasks?
The driver needs executors to run tasks. It asks the cluster manager to allocate these executors on worker nodes, as shown in steps 3 and 4 of the execution table.
What is the role of executors in Spark architecture?
Executors run the actual tasks on data partitions and send results back to the driver. This is shown in steps 5 and 6 where executors process data and return partial counts.
Why does the driver aggregate results after executors finish tasks?
Executors compute partial results on data partitions. The driver collects these partial results and combines them to get the final output, as seen in step 7.
Visual Quiz - 3 Questions
Test your understanding
Look at the execution table, at which step does the cluster manager allocate executors?
AStep 2
BStep 4
CStep 6
DStep 8
💡 Hint
Check the 'Component' and 'Action' columns in the execution table for cluster manager activities.
According to the variable tracker, what is the state of 'partial_counts' after step 5?
AList of counts from executors
BNone
CAggregated sum
DPrinted output
💡 Hint
Look at the 'partial_counts' row and the 'After Step 5' column in the variable tracker.
If the driver did not aggregate results, what would be missing in the execution table?
AExecutors running tasks
BCluster manager allocating executors
CFinal count computed
DUser submitting job
💡 Hint
Refer to step 7 in the execution table where the driver aggregates results.
Concept Snapshot
Spark architecture overview:
- Driver: coordinates job, creates tasks
- Cluster Manager: allocates executors
- Executors: run tasks on data partitions
- Data flows from user to driver, then executors, back to driver
- Driver aggregates results and completes job
Full Transcript
In Spark architecture, the user submits a job that starts the driver program. The driver initializes and requests resources from the cluster manager. The cluster manager allocates executors on worker nodes. Executors run tasks on data partitions and send partial results back to the driver. The driver aggregates these results to produce the final output and completes the job. Variables like the SparkSession, data, partial counts, and final count change state through these steps. Key points include the driver's role in resource requests and result aggregation, and executors' role in task execution.