GCPcloud~5 mins

Dataproc for Spark/Hadoop in GCP - Commands & Configuration

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Introduction

Running big data jobs like Spark or Hadoop can be complex and slow if you manage servers yourself. Dataproc is a Google Cloud service that quickly creates and manages clusters to run these jobs easily and efficiently.

When you want to process large datasets using Spark or Hadoop without setting up servers manually

When you need to run a data analysis job quickly and then shut down the resources to save cost

When you want to scale your data processing up or down automatically based on workload

When you want to integrate your big data jobs with other Google Cloud services like Cloud Storage

When you want to avoid managing complex infrastructure and focus on your data processing code

Config File - cluster-config.yaml

cluster-config.yaml

gceClusterConfig:
  zone: us-central1-a
  networkUri: default
masterConfig:
  numInstances: 1
  machineTypeUri: n1-standard-2
workerConfig:
  numInstances: 2
  machineTypeUri: n1-standard-2
softwareConfig:
  imageVersion: 2.0-debian10
  optionalComponents:
    - ANACONDA
    - JUPYTER
  properties:
    spark:spark.executor.memory: 4g
    spark:spark.driver.memory: 4g
    yarn:yarn.nodemanager.vmem-check-enabled: false

This YAML file defines a Dataproc cluster configuration.

gceClusterConfig: Sets the zone and network for the cluster.
masterConfig: Defines one master node with a standard machine type.
workerConfig: Defines two worker nodes with the same machine type.
softwareConfig: Specifies the Dataproc image version and optional components like Anaconda and Jupyter for data science tools.
properties: Sets Spark executor and driver memory to 4GB and disables YARN virtual memory check for stability.

Commands

This command creates a Dataproc cluster named 'example-cluster' in the us-central1 region using the configuration defined in the YAML file. It sets up the nodes and software automatically.

Terminal

gcloud dataproc clusters create example-cluster --region=us-central1 --file=cluster-config.yaml

Expected OutputExpected

Waiting for cluster creation operation...done. Created [https://dataproc.googleapis.com/v1/projects/my-project/regions/us-central1/clusters/example-cluster]. Cluster creation complete.

→

--region - Specifies the region where the cluster will be created

→

--file - Points to the cluster configuration file

This command lists all Dataproc clusters in the us-central1 region so you can verify that your cluster is running.

Terminal

gcloud dataproc clusters list --region=us-central1

Expected OutputExpected

NAME REGION STATUS example-cluster us-central1 RUNNING

→

--region - Filters clusters by region

This command submits a Spark job to the cluster to calculate Pi using 1000 tasks. It runs the example SparkPi program included with Spark.

Terminal

gcloud dataproc jobs submit spark --cluster=example-cluster --region=us-central1 --class=org.apache.spark.examples.SparkPi --jars=file:///usr/lib/spark/examples/jars/spark-examples.jar -- 1000

Expected OutputExpected

Job [job_1234567890] submitted. Waiting for job output... Job finished successfully. Pi is roughly 3.14

→

--cluster - Specifies which cluster to run the job on

→

--class - Specifies the main class of the Spark job

→

--jars - Specifies the jar file containing the Spark job

This command deletes the Dataproc cluster to stop billing and free resources. The --quiet flag skips confirmation prompts.

Terminal

gcloud dataproc clusters delete example-cluster --region=us-central1 --quiet

Expected OutputExpected

Deleting cluster [example-cluster]...done.

→

--quiet - Skips confirmation prompt to delete the cluster

Key Concept

If you remember nothing else from this pattern, remember: Dataproc lets you quickly create and manage clusters to run big data jobs without handling servers yourself.

Common Mistakes

Not specifying the region when creating or managing clusters

Dataproc commands require the region to know where to create or find clusters; missing it causes errors or wrong defaults.

Always include the --region flag with the correct region for your cluster.

Submitting jobs before the cluster is fully running

Jobs will fail if the cluster is not ready, causing wasted time and confusion.

Check cluster status with 'gcloud dataproc clusters list' and wait until it shows RUNNING before submitting jobs.

Forgetting to delete clusters after use

Clusters keep running and incur costs even if not used.

Delete clusters with 'gcloud dataproc clusters delete' when you finish to save money.

Summary

Create a Dataproc cluster using a YAML config file to define nodes and software.

List clusters to check their status and confirm they are running.

Submit Spark jobs to the cluster to process data or run examples.

Delete clusters when done to avoid unnecessary costs.