Apache Sparkdata~5 mins

Spark architecture (driver, executors, cluster manager) in Apache Spark

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Introduction

Spark architecture helps organize how big data tasks run efficiently on many computers. It splits work so tasks finish faster.

When processing large data sets that don't fit on one computer.

When you want to run data analysis or machine learning on a cluster of machines.

When you need to manage resources and tasks across many computers automatically.

When you want to speed up data processing by running tasks in parallel.

When you want to handle failures smoothly during big data processing.

Syntax

Apache Spark

Spark architecture has three main parts:

1. Driver: The main program that controls the job.
2. Executors: Workers that run tasks and store data.
3. Cluster Manager: The system that allocates resources and manages executors.

The Driver runs your main code and plans tasks.

Executors do the actual work on data in parallel.

Examples

This shows the roles of each part simply.

Apache Spark

Driver: Runs your Spark application code.
Executors: Run tasks and keep data in memory.
Cluster Manager: Allocates resources like CPU and memory.

Different cluster managers help Spark run on various systems.

Apache Spark

Cluster Manager options:
- Standalone (Spark's own manager)
- YARN (used in Hadoop clusters)
- Mesos (general cluster manager)
- Kubernetes (container orchestration)

Sample Program

This code starts a Spark Driver locally with 2 threads as Executors. It creates and shows a small table.

Apache Spark

from pyspark.sql import SparkSession

# Create SparkSession which starts the Driver
spark = SparkSession.builder.master('local[2]').appName('SparkArchitectureDemo').getOrCreate()

# Create a simple DataFrame
data = [('Alice', 34), ('Bob', 45), ('Cathy', 29)]
df = spark.createDataFrame(data, ['Name', 'Age'])

# Show the DataFrame
print('DataFrame content:')
df.show()

# Stop SparkSession
spark.stop()

OutputSuccess

Important Notes

The Driver schedules tasks but does not process data itself.

Executors run tasks in parallel and keep data in memory for speed.

The Cluster Manager handles resource allocation and can restart failed executors.

Summary

Driver controls the job and plans tasks.

Executors run tasks and store data.

Cluster Manager manages resources and executors.