0
0
Apache Sparkdata~10 mins

Google Dataproc overview in Apache Spark - Step-by-Step Execution

Choose your learning style9 modes available
Concept Flow - Google Dataproc overview
User submits job to Dataproc
Dataproc creates a cluster
Cluster runs Apache Spark job
Job processes data on cluster nodes
Results returned to user
User deletes cluster to save cost
This flow shows how a user submits a Spark job to Google Dataproc, which creates a cluster, runs the job, processes data, returns results, and then the cluster is deleted.
Execution Sample
Apache Spark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('example').getOrCreate()
data = spark.range(5)
data.show()
This code creates a Spark session on Dataproc, generates a simple dataset of numbers 0 to 4, and shows it.
Execution Table
StepActionSpark Session StateData CreatedOutput
1Create SparkSessionActiveNoneNone
2Generate range data 0-4Active[0,1,2,3,4]None
3Show dataActive[0,1,2,3,4]0 1 2 3 4
4Job completeActive[0,1,2,3,4]Displayed data
5Stop SparkSessionStopped[0,1,2,3,4]None
💡 Spark job finishes and session stops or cluster is deleted to save cost.
Variable Tracker
VariableStartAfter Step 1After Step 2After Step 3Final
sparkNoneSparkSession ActiveSparkSession ActiveSparkSession ActiveStopped
dataNoneNone[0,1,2,3,4][0,1,2,3,4][0,1,2,3,4]
Key Moments - 2 Insights
Why do we need to create a SparkSession before running any job?
The SparkSession is the entry point to use Spark. Without it, you cannot create or run Spark jobs. See execution_table step 1 where SparkSession is created first.
What happens if we do not stop the SparkSession or delete the cluster?
The cluster and SparkSession keep running and cost money. It's important to stop or delete them after the job finishes, as shown in execution_table step 5.
Visual Quiz - 3 Questions
Test your understanding
Look at the execution_table at step 3, what data is shown?
ANumbers 0 to 4
BEmpty dataset
CNumbers 1 to 5
DError message
💡 Hint
Check the 'Output' column at step 3 in execution_table.
At which step does the SparkSession become active?
AStep 2
BStep 1
CStep 3
DStep 5
💡 Hint
Look at the 'Spark Session State' column in execution_table.
If the cluster is not deleted after job completion, what is the likely impact?
ANo impact, it stops automatically
BJob will rerun automatically
CCosts continue to accumulate
DData will be lost
💡 Hint
Refer to key_moments about stopping SparkSession or deleting cluster.
Concept Snapshot
Google Dataproc runs Apache Spark jobs on managed clusters.
User submits job -> Dataproc creates cluster -> Spark job runs -> Results returned.
Stop or delete cluster after job to save cost.
SparkSession is needed to run Spark code.
Simple data like ranges can be created and shown easily.
Full Transcript
Google Dataproc is a managed service to run Apache Spark jobs on clusters in the cloud. The user submits a Spark job, Dataproc creates a cluster, runs the job, processes data, and returns results. The SparkSession is the main interface to run Spark code. After the job finishes, stopping the SparkSession or deleting the cluster is important to avoid extra costs. In the example, a SparkSession is created, a simple dataset of numbers 0 to 4 is generated, and the data is displayed. This shows the basic flow of running Spark jobs on Dataproc.