What is Local mode vs cluster mode in Apache Spark?

Apache Sparkdata~5 mins

Local mode vs cluster mode in Apache Spark

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Introduction

We use different modes to run Spark depending on the size of data and resources. Local mode runs on one computer, cluster mode uses many computers together.

Testing small data or code on your own laptop.

Learning Spark without needing a big setup.

Running quick jobs that don't need much power.

Processing large data that needs many computers.

Sharing work across a team using a cluster.

Syntax

Apache Spark

spark = SparkSession.builder.master("local[*]").appName("MyApp").getOrCreate()

# For cluster mode, master URL changes, e.g.,
spark = SparkSession.builder.master("spark://master-url:7077").appName("MyApp").getOrCreate()

local[*] means use all cores on your computer.

Cluster mode needs the address of the cluster master to connect.

Examples

Runs Spark locally using 1 CPU core.

Apache Spark

spark = SparkSession.builder.master("local[1]").appName("TestApp").getOrCreate()

Runs Spark locally using all available CPU cores.

Apache Spark

spark = SparkSession.builder.master("local[*]").appName("TestApp").getOrCreate()

Connects to a Spark cluster at the given IP address and port.

Apache Spark

spark = SparkSession.builder.master("spark://192.168.1.100:7077").appName("ClusterApp").getOrCreate()

Sample Program

This code runs Spark locally on your computer using all CPU cores. It creates a small table of fruits and shows it. Then it prints the mode Spark is running in.

Apache Spark

from pyspark.sql import SparkSession

# Create Spark session in local mode using all cores
spark = SparkSession.builder.master("local[*]").appName("LocalVsClusterDemo").getOrCreate()

# Create a simple DataFrame
data = [(1, "apple"), (2, "banana"), (3, "cherry")]
columns = ["id", "fruit"]
df = spark.createDataFrame(data, columns)

# Show the DataFrame
print("DataFrame content:")
df.show()

# Print Spark master URL to confirm mode
print(f"Running Spark in mode: {spark.sparkContext.master}")

spark.stop()

OutputSuccess

Important Notes

Local mode is great for learning and small tasks but can't handle big data well.

Cluster mode needs setup of multiple machines but can process large data fast.

Always check the master setting to know where your Spark job runs.

Summary

Local mode runs Spark on one computer, good for small data and testing.

Cluster mode runs Spark on many computers, needed for big data.

You choose mode by setting the master parameter when creating Spark session.