0
0
Apache Sparkdata~10 mins

Local mode vs cluster mode in Apache Spark - Visual Side-by-Side Comparison

Choose your learning style9 modes available
Concept Flow - Local mode vs cluster mode
Start Spark Application
Choose Mode
Local Mode
Run on Local
Single JVM
Collect Results
End
Spark starts and chooses either local mode or cluster mode. Local mode runs everything on one machine. Cluster mode splits work across many machines, then collects results.
Execution Sample
Apache Spark
from pyspark.sql import SparkSession
spark = SparkSession.builder.master('local[*]').appName('LocalApp').getOrCreate()
data = spark.range(5).collect()
spark.stop()
This code runs a Spark job in local mode, creating a small dataset and collecting it back.
Execution Table
StepActionModeDetailsResult
1Start SparkSessionLocalmaster='local[*]'Spark runs locally on all CPU cores
2Create DatasetLocalspark.range(5)Dataset with numbers 0 to 4 created
3Collect DataLocalcollect()Data collected to driver as list [0, 1, 2, 3, 4]
4Stop SparkLocalspark.stop()Spark session ends
5Start SparkSessionClustermaster='spark://cluster-master:7077'Spark connects to cluster manager
6Create DatasetClusterspark.range(5)Dataset created and split across cluster nodes
7Distribute TasksClusterTasks sent to worker nodesParallel processing on multiple machines
8Collect DataClustercollect()Results gathered back to driver node
9Stop SparkClusterspark.stop()Spark session ends
10Exit--All jobs finished
💡 Execution stops after Spark session is stopped in both modes.
Variable Tracker
VariableStartAfter Step 2After Step 3After Step 6After Step 8Final
sparkNoneSparkSession(local)SparkSession(local)SparkSession(cluster)SparkSession(cluster)None
dataNoneDataset(0-4)[0, 1, 2, 3, 4]Dataset(0-4)[0, 1, 2, 3, 4]None
Key Moments - 2 Insights
Why does local mode use 'local[*]' as master and cluster mode use a URL?
Local mode uses 'local[*]' to run Spark on all CPU cores of one machine (see execution_table step 1). Cluster mode uses a URL like 'spark://...' to connect to a cluster manager that controls many machines (see step 5).
How does data processing differ between local and cluster modes?
In local mode, all processing happens inside one JVM on one machine (step 3). In cluster mode, tasks are split and run on multiple worker nodes in parallel (step 7), then results are collected back (step 8).
Visual Quiz - 3 Questions
Test your understanding
Look at the execution table, what is the result after step 3 in local mode?
AData collected as list [0, 1, 2, 3, 4]
BDataset created but not collected
CSpark session stopped
DTasks distributed to cluster nodes
💡 Hint
Check the 'Result' column for step 3 in the execution_table.
At which step does Spark distribute tasks to multiple machines in cluster mode?
AStep 8
BStep 6
CStep 7
DStep 9
💡 Hint
Look for 'Distribute Tasks' action in cluster mode in the execution_table.
If you change master from 'local[*]' to 'local[1]', how does the execution change?
ASpark runs on multiple cluster nodes
BSpark runs on only one CPU core instead of all cores
CSpark cannot start
DSpark runs on GPU
💡 Hint
Refer to the meaning of 'local[*]' vs 'local[1]' in the concept_snapshot.
Concept Snapshot
Spark Modes Quick Reference:
- Local mode: master='local[*]' runs Spark on all CPU cores of one machine.
- Cluster mode: master='spark://...' connects to a cluster manager for distributed computing.
- Local mode runs in a single JVM; cluster mode runs tasks on many machines.
- Use local mode for testing and small jobs; cluster mode for big data processing.
- collect() gathers distributed data back to the driver program.
Full Transcript
This visual execution compares Spark's local mode and cluster mode. Spark starts by choosing a mode. In local mode, Spark runs everything on one machine using all CPU cores. The example code creates a small dataset of numbers 0 to 4 and collects it back to the driver. The execution table shows steps like starting SparkSession with 'local[*]', creating data, collecting it, and stopping Spark. In cluster mode, Spark connects to a cluster manager URL, splits the dataset across many machines, distributes tasks to worker nodes, collects results, and stops. Variables like 'spark' and 'data' change accordingly. Key moments clarify why local mode uses 'local[*]' and how cluster mode distributes work. The quiz tests understanding of execution steps and mode differences. The snapshot summarizes the main points for quick recall.