0
0
Apache Sparkdata~5 mins

Spark architecture (driver, executors, cluster manager) in Apache Spark

Choose your learning style9 modes available
Introduction

Spark architecture helps organize how big data tasks run efficiently on many computers. It splits work so tasks finish faster.

When processing large data sets that don't fit on one computer.
When you want to run data analysis or machine learning on a cluster of machines.
When you need to manage resources and tasks across many computers automatically.
When you want to speed up data processing by running tasks in parallel.
When you want to handle failures smoothly during big data processing.
Syntax
Apache Spark
Spark architecture has three main parts:

1. Driver: The main program that controls the job.
2. Executors: Workers that run tasks and store data.
3. Cluster Manager: The system that allocates resources and manages executors.

The Driver runs your main code and plans tasks.

Executors do the actual work on data in parallel.

Examples
This shows the roles of each part simply.
Apache Spark
Driver: Runs your Spark application code.
Executors: Run tasks and keep data in memory.
Cluster Manager: Allocates resources like CPU and memory.
Different cluster managers help Spark run on various systems.
Apache Spark
Cluster Manager options:
- Standalone (Spark's own manager)
- YARN (used in Hadoop clusters)
- Mesos (general cluster manager)
- Kubernetes (container orchestration)
Sample Program

This code starts a Spark Driver locally with 2 threads as Executors. It creates and shows a small table.

Apache Spark
from pyspark.sql import SparkSession

# Create SparkSession which starts the Driver
spark = SparkSession.builder.master('local[2]').appName('SparkArchitectureDemo').getOrCreate()

# Create a simple DataFrame
data = [('Alice', 34), ('Bob', 45), ('Cathy', 29)]
df = spark.createDataFrame(data, ['Name', 'Age'])

# Show the DataFrame
print('DataFrame content:')
df.show()

# Stop SparkSession
spark.stop()
OutputSuccess
Important Notes

The Driver schedules tasks but does not process data itself.

Executors run tasks in parallel and keep data in memory for speed.

The Cluster Manager handles resource allocation and can restart failed executors.

Summary

Driver controls the job and plans tasks.

Executors run tasks and store data.

Cluster Manager manages resources and executors.