0
0
GCPcloud~10 mins

Dataproc for Spark/Hadoop in GCP - Step-by-Step Execution

Choose your learning style9 modes available
Process Flow - Dataproc for Spark/Hadoop
User submits job
Dataproc cluster receives job
Job scheduled on cluster nodes
Spark/Hadoop processes data
Results stored in output location
User retrieves results
This flow shows how a user submits a Spark or Hadoop job to a Dataproc cluster, which processes the data and stores the results.
Execution Sample
GCP
gcloud dataproc jobs submit spark --cluster=my-cluster --class=org.apache.spark.examples.SparkPi --region=us-central1 --jars=gs://dataproc-examples-2.0/jars/spark-examples_2.12-3.3.1.jar 1000
This command submits a Spark job to a Dataproc cluster to calculate Pi using 1000 samples.
Process Table
StepActionInput/ConditionResult/Output
1Submit jobUser runs gcloud commandJob sent to Dataproc cluster
2Cluster receives jobJob arrives at clusterJob queued for execution
3Schedule jobCluster resources availableJob assigned to nodes
4Run Spark jobSparkPi class runs with 1000 samplesPi calculated approximately
5Store resultsJob completes successfullyOutput saved to storage
6Retrieve resultsUser checks output locationUser gets Pi result
7ExitJob finishedNo more actions
💡 Job finishes after results are stored and retrieved
Status Tracker
VariableStartAfter Step 2After Step 4Final
Job StatusNot submittedQueuedRunningCompleted
Pi ValueN/AN/A3.14 approx3.14 approx
Output LocationEmptyEmptyEmptyContains result file
Key Moments - 3 Insights
Why does the job status change from 'Queued' to 'Running'?
Because the cluster assigns the job to nodes when resources become available, as shown in execution_table step 3 and 4.
Where are the results stored after the job completes?
Results are stored in the output location, typically cloud storage, as shown in execution_table step 5.
What happens if the cluster has no available resources?
The job stays in the queue until resources free up, delaying step 3 scheduling.
Visual Quiz - 3 Questions
Test your understanding
Look at the execution_table, what is the job status after step 4?
ACompleted
BQueued
CRunning
DNot submitted
💡 Hint
Check the 'Job Status' variable in variable_tracker after step 4
At which step are the results stored in the output location?
AStep 5
BStep 4
CStep 3
DStep 6
💡 Hint
Look at execution_table row describing 'Store results'
If the user changes the sample size from 1000 to 10000, how would the execution_table change?
AStep 2 would be skipped
BStep 4 would take longer to run
CStep 5 would not store results
DJob would not be submitted
💡 Hint
Increasing samples affects processing time in step 4, see 'Run Spark job' action
Concept Snapshot
Dataproc runs Spark/Hadoop jobs on managed clusters.
User submits job via gcloud or API.
Cluster schedules and runs job on nodes.
Results saved to cloud storage.
Simple, scalable big data processing in the cloud.
Full Transcript
Dataproc lets users run Spark or Hadoop jobs easily on Google Cloud. The user submits a job, which the Dataproc cluster receives and queues. When resources are free, the cluster schedules the job on nodes. The Spark or Hadoop job processes data, for example calculating Pi with SparkPi. After processing, results are saved to cloud storage. The user can then retrieve the results. The job status changes from not submitted, to queued, to running, and finally completed. If resources are busy, the job waits in queue. Changing job parameters like sample size affects processing time. This flow simplifies big data processing by managing infrastructure automatically.