0
0
Apache Sparkdata~5 mins

Self joins in Apache Spark - Time & Space Complexity

Choose your learning style9 modes available
Time Complexity: Self joins
O(n^2)
Understanding Time Complexity

When we use self joins in Apache Spark, we combine a table with itself to find related rows.

We want to know how the time to do this grows as the data gets bigger.

Scenario Under Consideration

Analyze the time complexity of the following code snippet.

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([
    (1, "A"),
    (2, "B"),
    (3, "A"),
    (4, "C")
], ["id", "category"])

joined_df = df.alias("df1").join(df.alias("df2"), on="category")
joined_df.show()

This code joins the DataFrame with itself on the "category" column to find pairs with the same category.

Identify Repeating Operations

Identify the loops, recursion, array traversals that repeat.

  • Primary operation: Matching each row in the first DataFrame with rows in the second DataFrame that share the same category.
  • How many times: For each of the n rows, it compares with potentially many rows in the other copy, depending on category matches.
How Execution Grows With Input

As the number of rows grows, the number of comparisons grows quickly because each row can match multiple rows in the other copy.

Input Size (n)Approx. Operations
10About 100 comparisons
100About 10,000 comparisons
1000About 1,000,000 comparisons

Pattern observation: The work grows roughly with the square of the input size.

Final Time Complexity

Time Complexity: O(n^2)

This means if you double the data size, the work to join grows about four times.

Common Mistake

[X] Wrong: "Self joins take linear time because it's just one table."

[OK] Correct: Actually, joining a table with itself compares many pairs of rows, so the work grows much faster than just reading the data once.

Interview Connect

Understanding how self joins scale helps you explain performance in real data tasks and shows you can think about costs clearly.

Self-Check

"What if we added a filter before the join to reduce rows? How would the time complexity change?"