Apache Sparkdata~10 mins

Local mode vs cluster mode in Apache Spark - Visual Side-by-Side Comparison

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Concept Flow - Local mode vs cluster mode

Start Spark Application

↓

Choose Mode

↓

Local Mode

↓

Run on Local

↓

Single JVM

↓

Collect Results

↓

End

Spark starts and chooses either local mode or cluster mode. Local mode runs everything on one machine. Cluster mode splits work across many machines, then collects results.

Execution Sample

Apache Spark

from pyspark.sql import SparkSession
spark = SparkSession.builder.master('local[*]').appName('LocalApp').getOrCreate()
data = spark.range(5).collect()
spark.stop()

This code runs a Spark job in local mode, creating a small dataset and collecting it back.

Execution Table

Step	Action	Mode	Details	Result
1	Start SparkSession	Local	master='local[*]'	Spark runs locally on all CPU cores
2	Create Dataset	Local	spark.range(5)	Dataset with numbers 0 to 4 created
3	Collect Data	Local	collect()	Data collected to driver as list [0, 1, 2, 3, 4]
4	Stop Spark	Local	spark.stop()	Spark session ends
5	Start SparkSession	Cluster	master='spark://cluster-master:7077'	Spark connects to cluster manager
6	Create Dataset	Cluster	spark.range(5)	Dataset created and split across cluster nodes
7	Distribute Tasks	Cluster	Tasks sent to worker nodes	Parallel processing on multiple machines
8	Collect Data	Cluster	collect()	Results gathered back to driver node
9	Stop Spark	Cluster	spark.stop()	Spark session ends
10	Exit	-	-	All jobs finished

💡 Execution stops after Spark session is stopped in both modes.

Variable Tracker

Variable	Start	After Step 2	After Step 3	After Step 6	After Step 8	Final
spark	None	SparkSession(local)	SparkSession(local)	SparkSession(cluster)	SparkSession(cluster)	None
data	None	Dataset(0-4)	[0, 1, 2, 3, 4]	Dataset(0-4)	[0, 1, 2, 3, 4]	None

Key Moments - 2 Insights

Why does local mode use 'local[*]' as master and cluster mode use a URL?

How does data processing differ between local and cluster modes?

Visual Quiz - 3 Questions

Test your understanding

Look at the execution table, what is the result after step 3 in local mode?

AData collected as list [0, 1, 2, 3, 4]

BDataset created but not collected

CSpark session stopped

DTasks distributed to cluster nodes

Concept Snapshot

Spark Modes Quick Reference:
- Local mode: master='local[*]' runs Spark on all CPU cores of one machine.
- Cluster mode: master='spark://...' connects to a cluster manager for distributed computing.
- Local mode runs in a single JVM; cluster mode runs tasks on many machines.
- Use local mode for testing and small jobs; cluster mode for big data processing.
- collect() gathers distributed data back to the driver program.

Full Transcript

This visual execution compares Spark's local mode and cluster mode. Spark starts by choosing a mode. In local mode, Spark runs everything on one machine using all CPU cores. The example code creates a small dataset of numbers 0 to 4 and collects it back to the driver. The execution table shows steps like starting SparkSession with 'local[*]', creating data, collecting it, and stopping Spark. In cluster mode, Spark connects to a cluster manager URL, splits the dataset across many machines, distributes tasks to worker nodes, collects results, and stops. Variables like 'spark' and 'data' change accordingly. Key moments clarify why local mode uses 'local[*]' and how cluster mode distributes work. The quiz tests understanding of execution steps and mode differences. The snapshot summarizes the main points for quick recall.