Apache Sparkdata~30 mins

Cross joins and when to avoid them in Apache Spark - Mini Project: Build & Apply

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Cross joins and when to avoid them

📖 Scenario: You work at a retail company. You want to combine two datasets: one with product categories and another with store locations. You want to see all possible category-location pairs to plan marketing campaigns.

🎯 Goal: Build a Spark program that creates two DataFrames, performs a cross join to get all category-location pairs, and then prints the result.

📋 What You'll Learn

Create a DataFrame called categories with exact entries: 'category' column with values 'Electronics', 'Clothing', 'Toys'

Create a DataFrame called stores with exact entries: 'store' column with values 'New York', 'Los Angeles'

Create a boolean variable called allow_cross_join and set it to True

Use the crossJoin() method on categories and stores only if allow_cross_join is True

Print the resulting DataFrame with all category-store pairs

💡 Why This Matters

🌍 Real World

Cross joins help generate all possible combinations of two datasets, useful in marketing, scheduling, and recommendation systems.

💼 Career

Data scientists and analysts often need to combine datasets fully to explore relationships or prepare data for modeling.

Progress0 / 4 steps

Create the categories and stores DataFrames

Create a Spark DataFrame called categories with a single column 'category' containing these exact values: 'Electronics', 'Clothing', 'Toys'. Also create a Spark DataFrame called stores with a single column 'store' containing these exact values: 'New York', 'Los Angeles'.

Apache Spark

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("CrossJoinExample").getOrCreate()
# Create the categories DataFrame
# Create the stores DataFrame

Need a hint?

Use spark.createDataFrame with a list of tuples and specify the column names.

Add a configuration variable to allow cross join

Create a boolean variable called allow_cross_join and set it to True.

Apache Spark

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("CrossJoinExample").getOrCreate()
categories = spark.createDataFrame([
    ("Electronics",),
    ("Clothing",),
    ("Toys",)
], ["category"])
stores = spark.createDataFrame([
    ("New York",),
    ("Los Angeles",)
], ["store"])
# Create allow_cross_join variable and set it to True

Need a hint?

Just create a variable named allow_cross_join and assign True.

Perform the cross join if allowed

Use an if statement to check if allow_cross_join is True. Inside the if, create a new DataFrame called category_store_pairs by performing a cross join of categories and stores using the crossJoin() method.

Apache Spark

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("CrossJoinExample").getOrCreate()
categories = spark.createDataFrame([
    ("Electronics",),
    ("Clothing",),
    ("Toys",)
], ["category"])
stores = spark.createDataFrame([
    ("New York",),
    ("Los Angeles",)
], ["store"])

allow_cross_join = True
# Use if statement to check allow_cross_join and perform crossJoin

Need a hint?

Use if allow_cross_join: and inside it assign category_store_pairs = categories.crossJoin(stores).

Print the cross join result

Print the category_store_pairs DataFrame using the show() method to display all category and store pairs.

Apache Spark

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("CrossJoinExample").getOrCreate()
categories = spark.createDataFrame([
    ("Electronics",),
    ("Clothing",),
    ("Toys",)
], ["category"])
stores = spark.createDataFrame([
    ("New York",),
    ("Los Angeles",)
], ["store"])

allow_cross_join = True

if allow_cross_join:
    category_store_pairs = categories.crossJoin(stores)
# Print the category_store_pairs DataFrame

Need a hint?

Use category_store_pairs.show() to print the DataFrame.