0
0
GCPcloud~5 mins

Dataproc for Spark/Hadoop in GCP - Commands & Configuration

Choose your learning style9 modes available
Introduction
Running big data jobs like Spark or Hadoop can be complex and slow if you manage servers yourself. Dataproc is a Google Cloud service that quickly creates and manages clusters to run these jobs easily and efficiently.
When you want to process large datasets using Spark or Hadoop without setting up servers manually
When you need to run a data analysis job quickly and then shut down the resources to save cost
When you want to scale your data processing up or down automatically based on workload
When you want to integrate your big data jobs with other Google Cloud services like Cloud Storage
When you want to avoid managing complex infrastructure and focus on your data processing code
Config File - cluster-config.yaml
cluster-config.yaml
gceClusterConfig:
  zone: us-central1-a
  networkUri: default
masterConfig:
  numInstances: 1
  machineTypeUri: n1-standard-2
workerConfig:
  numInstances: 2
  machineTypeUri: n1-standard-2
softwareConfig:
  imageVersion: 2.0-debian10
  optionalComponents:
    - ANACONDA
    - JUPYTER
  properties:
    spark:spark.executor.memory: 4g
    spark:spark.driver.memory: 4g
    yarn:yarn.nodemanager.vmem-check-enabled: false

This YAML file defines a Dataproc cluster configuration.

  • gceClusterConfig: Sets the zone and network for the cluster.
  • masterConfig: Defines one master node with a standard machine type.
  • workerConfig: Defines two worker nodes with the same machine type.
  • softwareConfig: Specifies the Dataproc image version and optional components like Anaconda and Jupyter for data science tools.
  • properties: Sets Spark executor and driver memory to 4GB and disables YARN virtual memory check for stability.
Commands
This command creates a Dataproc cluster named 'example-cluster' in the us-central1 region using the configuration defined in the YAML file. It sets up the nodes and software automatically.
Terminal
gcloud dataproc clusters create example-cluster --region=us-central1 --file=cluster-config.yaml
Expected OutputExpected
Waiting for cluster creation operation...done. Created [https://dataproc.googleapis.com/v1/projects/my-project/regions/us-central1/clusters/example-cluster]. Cluster creation complete.
--region - Specifies the region where the cluster will be created
--file - Points to the cluster configuration file
This command lists all Dataproc clusters in the us-central1 region so you can verify that your cluster is running.
Terminal
gcloud dataproc clusters list --region=us-central1
Expected OutputExpected
NAME REGION STATUS example-cluster us-central1 RUNNING
--region - Filters clusters by region
This command submits a Spark job to the cluster to calculate Pi using 1000 tasks. It runs the example SparkPi program included with Spark.
Terminal
gcloud dataproc jobs submit spark --cluster=example-cluster --region=us-central1 --class=org.apache.spark.examples.SparkPi --jars=file:///usr/lib/spark/examples/jars/spark-examples.jar -- 1000
Expected OutputExpected
Job [job_1234567890] submitted. Waiting for job output... Job finished successfully. Pi is roughly 3.14
--cluster - Specifies which cluster to run the job on
--class - Specifies the main class of the Spark job
--jars - Specifies the jar file containing the Spark job
This command deletes the Dataproc cluster to stop billing and free resources. The --quiet flag skips confirmation prompts.
Terminal
gcloud dataproc clusters delete example-cluster --region=us-central1 --quiet
Expected OutputExpected
Deleting cluster [example-cluster]...done.
--quiet - Skips confirmation prompt to delete the cluster
Key Concept

If you remember nothing else from this pattern, remember: Dataproc lets you quickly create and manage clusters to run big data jobs without handling servers yourself.

Common Mistakes
Not specifying the region when creating or managing clusters
Dataproc commands require the region to know where to create or find clusters; missing it causes errors or wrong defaults.
Always include the --region flag with the correct region for your cluster.
Submitting jobs before the cluster is fully running
Jobs will fail if the cluster is not ready, causing wasted time and confusion.
Check cluster status with 'gcloud dataproc clusters list' and wait until it shows RUNNING before submitting jobs.
Forgetting to delete clusters after use
Clusters keep running and incur costs even if not used.
Delete clusters with 'gcloud dataproc clusters delete' when you finish to save money.
Summary
Create a Dataproc cluster using a YAML config file to define nodes and software.
List clusters to check their status and confirm they are running.
Submit Spark jobs to the cluster to process data or run examples.
Delete clusters when done to avoid unnecessary costs.