0
0
Apache Sparkdata~30 mins

Databricks platform overview in Apache Spark - Mini Project: Build & Apply

Choose your learning style9 modes available
Databricks Platform Overview
📖 Scenario: You have just joined a data science team that uses Databricks to analyze big data. Your first task is to get familiar with the Databricks platform by creating a simple dataset, configuring a setting, applying a basic Spark operation, and displaying the result.
🎯 Goal: Build a simple Databricks notebook workflow that creates a dataset, sets a configuration, performs a Spark transformation, and shows the output.
📋 What You'll Learn
Create a Spark DataFrame with specific data
Set a Spark configuration variable
Use a Spark transformation to filter data
Display the filtered DataFrame
💡 Why This Matters
🌍 Real World
Databricks is widely used in companies to process and analyze large datasets quickly using Spark. This project helps you understand the basic workflow of creating data, configuring Spark, transforming data, and viewing results.
💼 Career
Data scientists and data engineers use Databricks daily to prepare data for analysis, build machine learning models, and generate reports. Knowing how to work with DataFrames and Spark configurations is essential for these roles.
Progress0 / 4 steps
1
Create a Spark DataFrame
Create a Spark DataFrame called df with these exact rows: (1, 'Apple'), (2, 'Banana'), (3, 'Cherry'). Use the columns 'id' and 'fruit'.
Apache Spark
Need a hint?

Use spark.createDataFrame with a list of tuples and specify the column names as a list.

2
Set a Spark Configuration
Set the Spark configuration spark.sql.shuffle.partitions to 2 using spark.conf.set.
Apache Spark
Need a hint?

Use spark.conf.set with the key and value as strings.

3
Filter the DataFrame
Create a new DataFrame called filtered_df by filtering df to keep only rows where the id is greater than 1.
Apache Spark
Need a hint?

Use the filter method on df with the condition df.id > 1.

4
Display the Filtered DataFrame
Use filtered_df.show() to display the filtered DataFrame.
Apache Spark
Need a hint?

Call show() on filtered_df to print the table.