Self joins in Apache Spark - Time & Space Complexity
When we use self joins in Apache Spark, we combine a table with itself to find related rows.
We want to know how the time to do this grows as the data gets bigger.
Analyze the time complexity of the following code snippet.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([
(1, "A"),
(2, "B"),
(3, "A"),
(4, "C")
], ["id", "category"])
joined_df = df.alias("df1").join(df.alias("df2"), on="category")
joined_df.show()
This code joins the DataFrame with itself on the "category" column to find pairs with the same category.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Matching each row in the first DataFrame with rows in the second DataFrame that share the same category.
- How many times: For each of the n rows, it compares with potentially many rows in the other copy, depending on category matches.
As the number of rows grows, the number of comparisons grows quickly because each row can match multiple rows in the other copy.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 100 comparisons |
| 100 | About 10,000 comparisons |
| 1000 | About 1,000,000 comparisons |
Pattern observation: The work grows roughly with the square of the input size.
Time Complexity: O(n^2)
This means if you double the data size, the work to join grows about four times.
[X] Wrong: "Self joins take linear time because it's just one table."
[OK] Correct: Actually, joining a table with itself compares many pairs of rows, so the work grows much faster than just reading the data once.
Understanding how self joins scale helps you explain performance in real data tasks and shows you can think about costs clearly.
"What if we added a filter before the join to reduce rows? How would the time complexity change?"