Apache Sparkdata~30 mins

Caching and persistence in Apache Spark - Mini Project: Build & Apply

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Caching and Persistence in Apache Spark

📖 Scenario: You work as a data analyst for a retail company. You have a large dataset of sales transactions. You want to speed up repeated analysis by storing the data in memory or on disk.

🎯 Goal: Learn how to cache and persist a Spark DataFrame to improve performance for repeated queries.

📋 What You'll Learn

Create a Spark DataFrame with sales data

Set a cache or persist configuration

Apply caching or persistence to the DataFrame

Show the cached or persisted DataFrame output

💡 Why This Matters

🌍 Real World

Caching and persistence help speed up repeated data analysis tasks by storing data in memory or on disk, reducing computation time.

💼 Career

Data engineers and data scientists use caching and persistence in Spark to optimize performance of big data pipelines and interactive queries.

Progress0 / 4 steps

Create the sales DataFrame

Create a Spark DataFrame called sales_df with these exact rows: ("2024-01-01", "Alice", 100), ("2024-01-02", "Bob", 150), and ("2024-01-03", "Charlie", 200). Use columns named date, customer, and amount.

Apache Spark

# Your code here to create sales_df with the specified rows and columns

Need a hint?

Use spark.createDataFrame with a list of tuples and a list of column names.

Set the cache configuration

Create a variable called cache_enabled and set it to True to indicate caching is enabled.

Apache Spark

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("CachingExample").getOrCreate()
sales_data = [("2024-01-01", "Alice", 100), ("2024-01-02", "Bob", 150), ("2024-01-03", "Charlie", 200)]
sales_df = spark.createDataFrame(sales_data, ["date", "customer", "amount"])
# Create cache_enabled variable and set it to True
# Your code here

Need a hint?

Just create a variable named cache_enabled and assign True.

Apply caching to the DataFrame

Use an if statement to check if cache_enabled is True. If yes, call cache() on sales_df and assign it back to sales_df.

Apache Spark

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("CachingExample").getOrCreate()
sales_data = [("2024-01-01", "Alice", 100), ("2024-01-02", "Bob", 150), ("2024-01-03", "Charlie", 200)]
sales_df = spark.createDataFrame(sales_data, ["date", "customer", "amount"])
cache_enabled = True
# Use if statement to cache sales_df if cache_enabled is True
# Your code here

Need a hint?

Use if cache_enabled: and then sales_df = sales_df.cache().

Show the cached DataFrame

Use sales_df.show() to display the contents of the cached DataFrame.

Apache Spark

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("CachingExample").getOrCreate()
sales_data = [("2024-01-01", "Alice", 100), ("2024-01-02", "Bob", 150), ("2024-01-03", "Charlie", 200)]
sales_df = spark.createDataFrame(sales_data, ["date", "customer", "amount"])
cache_enabled = True
if cache_enabled:
    sales_df = sales_df.cache()
# Show the sales_df DataFrame
# Your code here

Need a hint?

Use sales_df.show() to print the DataFrame.