Apache Sparkdata~10 mins

Google Dataproc overview in Apache Spark - Step-by-Step Execution

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Concept Flow - Google Dataproc overview

User submits job to Dataproc

↓

Dataproc creates a cluster

↓

Cluster runs Apache Spark job

↓

Job processes data on cluster nodes

↓

Results returned to user

↓

User deletes cluster to save cost

This flow shows how a user submits a Spark job to Google Dataproc, which creates a cluster, runs the job, processes data, returns results, and then the cluster is deleted.

Execution Sample

Apache Spark

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('example').getOrCreate()
data = spark.range(5)
data.show()

This code creates a Spark session on Dataproc, generates a simple dataset of numbers 0 to 4, and shows it.

Execution Table

Step	Action	Spark Session State	Data Created	Output
1	Create SparkSession	Active	None	None
2	Generate range data 0-4	Active	[0,1,2,3,4]	None
3	Show data	Active	[0,1,2,3,4]	0 1 2 3 4
4	Job complete	Active	[0,1,2,3,4]	Displayed data
5	Stop SparkSession	Stopped	[0,1,2,3,4]	None

💡 Spark job finishes and session stops or cluster is deleted to save cost.

Variable Tracker

Variable	Start	After Step 1	After Step 2	After Step 3	Final
spark	None	SparkSession Active	SparkSession Active	SparkSession Active	Stopped
data	None	None	[0,1,2,3,4]	[0,1,2,3,4]	[0,1,2,3,4]

Key Moments - 2 Insights

Why do we need to create a SparkSession before running any job?

What happens if we do not stop the SparkSession or delete the cluster?

Visual Quiz - 3 Questions

Test your understanding

Look at the execution_table at step 3, what data is shown?

ANumbers 0 to 4

BEmpty dataset

CNumbers 1 to 5

DError message

Concept Snapshot

Google Dataproc runs Apache Spark jobs on managed clusters.
User submits job -> Dataproc creates cluster -> Spark job runs -> Results returned.
Stop or delete cluster after job to save cost.
SparkSession is needed to run Spark code.
Simple data like ranges can be created and shown easily.

Full Transcript

Google Dataproc is a managed service to run Apache Spark jobs on clusters in the cloud. The user submits a Spark job, Dataproc creates a cluster, runs the job, processes data, and returns results. The SparkSession is the main interface to run Spark code. After the job finishes, stopping the SparkSession or deleting the cluster is important to avoid extra costs. In the example, a SparkSession is created, a simple dataset of numbers 0 to 4 is generated, and the data is displayed. This shows the basic flow of running Spark jobs on Dataproc.