Apache Sparkdata~10 mins

Cross joins and when to avoid them in Apache Spark - Step-by-Step Execution

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Concept Flow - Cross joins and when to avoid them

Start with two DataFrames

↓

Apply cross join

↓

Result: Cartesian product

↓

Check size: rows = rows1 * rows2

↓

Use result or avoid if too large

↓

End

Cross join combines every row of one DataFrame with every row of another, creating a Cartesian product. This can grow very large quickly, so check size before using.

Execution Sample

Apache Spark

df1 = spark.createDataFrame([(1, "A"), (2, "B")], ["id", "val"])
df2 = spark.createDataFrame([(10, "X"), (20, "Y")], ["num", "char"])
cross_df = df1.crossJoin(df2)
cross_df.show()

This code creates two small DataFrames and performs a cross join to combine every row from df1 with every row from df2.

Execution Table

Step	Action	df1 Rows	df2 Rows	Result Rows	Explanation
1	Create df1	2	-	-	df1 has 2 rows: (1, A), (2, B)
2	Create df2	-	2	-	df2 has 2 rows: (10, X), (20, Y)
3	Perform cross join	2	2	4	Each row in df1 pairs with each row in df2, total 2*2=4 rows
4	Show result	-	-	-	Result rows: (1,A,10,X), (1,A,20,Y), (2,B,10,X), (2,B,20,Y)
5	Check size	-	-	4	Result size is manageable, cross join OK here
6	End	-	-	-	Execution stops after showing cross join result

💡 Execution stops after cross join result is shown and size checked.

Variable Tracker

Variable	Start	After Step 1	After Step 2	After Step 3	Final
df1	undefined	2 rows	2 rows	2 rows	2 rows
df2	undefined	undefined	2 rows	2 rows	2 rows
cross_df	undefined	undefined	undefined	4 rows	4 rows

Key Moments - 2 Insights

Why does the number of rows in the cross join result equal the product of the input DataFrames' row counts?

When should you avoid using cross joins in Spark?

Visual Quiz - 3 Questions

Test your understanding

Look at the execution table, how many rows does the cross join result have after step 3?

Concept Snapshot

Cross join syntax: df1.crossJoin(df2)
Produces Cartesian product: every row of df1 with every row of df2
Result rows = rows_df1 * rows_df2
Avoid if inputs are large to prevent huge outputs
Check result size before using cross join

Full Transcript

Cross joins in Apache Spark combine every row from one DataFrame with every row from another, creating a Cartesian product. This means the output rows equal the product of the input DataFrames' row counts. For example, joining a DataFrame with 2 rows and another with 2 rows results in 4 rows. This can quickly become very large and slow down your program or cause memory errors. Therefore, always check the size of your inputs before using cross joins. If the inputs are large, try to avoid cross joins or use filters to reduce data before joining. The example code creates two small DataFrames and performs a cross join, showing the combined rows. The execution table traces each step, showing how the number of rows changes and when the process stops.