GCPcloud~30 mins

Dataproc for Spark/Hadoop in GCP - Mini Project: Build & Apply

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Dataproc Cluster Setup for Spark Job

📖 Scenario: You are working as a cloud engineer for a company that wants to run big data processing jobs using Apache Spark on Google Cloud Platform. Your task is to create a Dataproc cluster, configure it, and submit a Spark job.

🎯 Goal: Build a Dataproc cluster configuration and prepare it to run a Spark job on Google Cloud Platform.

📋 What You'll Learn

Create a Dataproc cluster configuration dictionary with exact keys and values

Add a configuration variable for the cluster region

Write the code to submit a Spark job using the cluster configuration

Complete the cluster creation command with all required parameters

💡 Why This Matters

🌍 Real World

Dataproc clusters are used to run big data processing jobs on Google Cloud Platform, enabling scalable and managed Spark and Hadoop workloads.

💼 Career

Cloud engineers and data engineers often create and manage Dataproc clusters to run data analytics and processing pipelines efficiently.

Progress0 / 4 steps

Create Dataproc cluster configuration dictionary

Create a dictionary called cluster_config with these exact entries: 'project_id': 'my-gcp-project', 'cluster_name': 'spark-cluster', 'num_workers': 2, and 'master_machine_type': 'n1-standard-4'.

GCP

# Create the cluster_config dictionary with specified keys and values
# Your code here

Need a hint?

Use curly braces to create a dictionary with the exact keys and values.

Add cluster region configuration

Add a variable called region and set it to the string 'us-central1'.

GCP

cluster_config = {
    'project_id': 'my-gcp-project',
    'cluster_name': 'spark-cluster',
    'num_workers': 2,
    'master_machine_type': 'n1-standard-4'
}
# Define the region variable
# Your code here

Need a hint?

Assign the string 'us-central1' to the variable named region.

Write Spark job submission code

Write a dictionary called job_config with keys 'cluster_name' and 'main_jar_file_uri'. Set 'cluster_name' to cluster_config['cluster_name'] and 'main_jar_file_uri' to 'gs://my-bucket/spark-job.jar'.

GCP

cluster_config = {
    'project_id': 'my-gcp-project',
    'cluster_name': 'spark-cluster',
    'num_workers': 2,
    'master_machine_type': 'n1-standard-4'
}
region = 'us-central1'
# Create job_config dictionary with cluster_name and main_jar_file_uri
# Your code here

Need a hint?

Use the cluster name from cluster_config and set the jar file URI as shown.

Complete Dataproc cluster creation command

Write a command string called create_cluster_cmd that uses gcloud dataproc clusters create with cluster_config['cluster_name'], --region set to region, --num-workers set to cluster_config['num_workers'], and --master-machine-type set to cluster_config['master_machine_type'].

GCP

cluster_config = {
    'project_id': 'my-gcp-project',
    'cluster_name': 'spark-cluster',
    'num_workers': 2,
    'master_machine_type': 'n1-standard-4'
}
region = 'us-central1'
job_config = {
    'cluster_name': cluster_config['cluster_name'],
    'main_jar_file_uri': 'gs://my-bucket/spark-job.jar'
}
# Create the create_cluster_cmd string with gcloud dataproc clusters create command
# Your code here

Need a hint?

Use an f-string to build the command with the exact parameters from cluster_config and region.