0
0
GCPcloud~30 mins

Dataproc for Spark/Hadoop in GCP - Mini Project: Build & Apply

Choose your learning style9 modes available
Dataproc Cluster Setup for Spark Job
📖 Scenario: You are working as a cloud engineer for a company that wants to run big data processing jobs using Apache Spark on Google Cloud Platform. Your task is to create a Dataproc cluster, configure it, and submit a Spark job.
🎯 Goal: Build a Dataproc cluster configuration and prepare it to run a Spark job on Google Cloud Platform.
📋 What You'll Learn
Create a Dataproc cluster configuration dictionary with exact keys and values
Add a configuration variable for the cluster region
Write the code to submit a Spark job using the cluster configuration
Complete the cluster creation command with all required parameters
💡 Why This Matters
🌍 Real World
Dataproc clusters are used to run big data processing jobs on Google Cloud Platform, enabling scalable and managed Spark and Hadoop workloads.
💼 Career
Cloud engineers and data engineers often create and manage Dataproc clusters to run data analytics and processing pipelines efficiently.
Progress0 / 4 steps
1
Create Dataproc cluster configuration dictionary
Create a dictionary called cluster_config with these exact entries: 'project_id': 'my-gcp-project', 'cluster_name': 'spark-cluster', 'num_workers': 2, and 'master_machine_type': 'n1-standard-4'.
GCP
Need a hint?

Use curly braces to create a dictionary with the exact keys and values.

2
Add cluster region configuration
Add a variable called region and set it to the string 'us-central1'.
GCP
Need a hint?

Assign the string 'us-central1' to the variable named region.

3
Write Spark job submission code
Write a dictionary called job_config with keys 'cluster_name' and 'main_jar_file_uri'. Set 'cluster_name' to cluster_config['cluster_name'] and 'main_jar_file_uri' to 'gs://my-bucket/spark-job.jar'.
GCP
Need a hint?

Use the cluster name from cluster_config and set the jar file URI as shown.

4
Complete Dataproc cluster creation command
Write a command string called create_cluster_cmd that uses gcloud dataproc clusters create with cluster_config['cluster_name'], --region set to region, --num-workers set to cluster_config['num_workers'], and --master-machine-type set to cluster_config['master_machine_type'].
GCP
Need a hint?

Use an f-string to build the command with the exact parameters from cluster_config and region.