Apache Sparkdata~30 mins

Databricks platform overview in Apache Spark - Mini Project: Build & Apply

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Databricks Platform Overview

📖 Scenario: You have just joined a data science team that uses Databricks to analyze big data. Your first task is to get familiar with the Databricks platform by creating a simple dataset, configuring a setting, applying a basic Spark operation, and displaying the result.

🎯 Goal: Build a simple Databricks notebook workflow that creates a dataset, sets a configuration, performs a Spark transformation, and shows the output.

📋 What You'll Learn

Create a Spark DataFrame with specific data

Set a Spark configuration variable

Use a Spark transformation to filter data

Display the filtered DataFrame

💡 Why This Matters

🌍 Real World

Databricks is widely used in companies to process and analyze large datasets quickly using Spark. This project helps you understand the basic workflow of creating data, configuring Spark, transforming data, and viewing results.

💼 Career

Data scientists and data engineers use Databricks daily to prepare data for analysis, build machine learning models, and generate reports. Knowing how to work with DataFrames and Spark configurations is essential for these roles.

Progress0 / 4 steps

Create a Spark DataFrame

Create a Spark DataFrame called df with these exact rows: (1, 'Apple'), (2, 'Banana'), (3, 'Cherry'). Use the columns 'id' and 'fruit'.

Apache Spark

# Create the DataFrame df with the given rows and columns
# Your code here

Need a hint?

Use spark.createDataFrame with a list of tuples and specify the column names as a list.

Set a Spark Configuration

Set the Spark configuration spark.sql.shuffle.partitions to 2 using spark.conf.set.

Apache Spark

df = spark.createDataFrame([(1, 'Apple'), (2, 'Banana'), (3, 'Cherry')], ['id', 'fruit'])
# Set spark.sql.shuffle.partitions to 2
# Your code here

Need a hint?

Use spark.conf.set with the key and value as strings.

Filter the DataFrame

Create a new DataFrame called filtered_df by filtering df to keep only rows where the id is greater than 1.

Apache Spark

df = spark.createDataFrame([(1, 'Apple'), (2, 'Banana'), (3, 'Cherry')], ['id', 'fruit'])
spark.conf.set('spark.sql.shuffle.partitions', '2')
# Filter df where id > 1 and assign to filtered_df
# Your code here

Need a hint?

Use the filter method on df with the condition df.id > 1.

Display the Filtered DataFrame

Use filtered_df.show() to display the filtered DataFrame.

Apache Spark

df = spark.createDataFrame([(1, 'Apple'), (2, 'Banana'), (3, 'Cherry')], ['id', 'fruit'])
spark.conf.set('spark.sql.shuffle.partitions', '2')
filtered_df = df.filter(df.id > 1)
# Show the filtered DataFrame
# Your code here

Need a hint?

Call show() on filtered_df to print the table.