Cross joins and when to avoid them
📖 Scenario: You work at a retail company. You want to combine two datasets: one with product categories and another with store locations. You want to see all possible category-location pairs to plan marketing campaigns.
🎯 Goal: Build a Spark program that creates two DataFrames, performs a cross join to get all category-location pairs, and then prints the result.
📋 What You'll Learn
Create a DataFrame called
categories with exact entries: 'category' column with values 'Electronics', 'Clothing', 'Toys'Create a DataFrame called
stores with exact entries: 'store' column with values 'New York', 'Los Angeles'Create a boolean variable called
allow_cross_join and set it to TrueUse the
crossJoin() method on categories and stores only if allow_cross_join is TruePrint the resulting DataFrame with all category-store pairs
💡 Why This Matters
🌍 Real World
Cross joins help generate all possible combinations of two datasets, useful in marketing, scheduling, and recommendation systems.
💼 Career
Data scientists and analysts often need to combine datasets fully to explore relationships or prepare data for modeling.
Progress0 / 4 steps