0
0
Apache Sparkdata~5 mins

Caching and persistence in Apache Spark - Time & Space Complexity

Choose your learning style9 modes available
Time Complexity: Caching and persistence
O(n)
Understanding Time Complexity

When working with Apache Spark, caching and persistence help speed up repeated data use.

We want to see how caching changes the time it takes to run operations again.

Scenario Under Consideration

Analyze the time complexity of the following code snippet.

val data = spark.read.csv("data.csv")
val transformed = data.filter(_(2) === "USA").select("_c0", "_c1")
transformed.cache()
transformed.count()
transformed.collect()

This code loads data, filters rows, caches the result, then runs two actions that use the cached data.

Identify Repeating Operations

Identify the loops, recursion, array traversals that repeat.

  • Primary operation: Filtering and selecting rows from the dataset.
  • How many times: Without caching, these run every time an action is called; with caching, they run once.
How Execution Grows With Input

Without caching, each action repeats the full data scan, so time grows with data size every time.

Input Size (n)Approx. Operations Without CacheApprox. Operations With Cache
1020 (2 actions x 10)10 + 10 (cache once + reuse)
100200 (2 x 100)100 + 100
10002000 (2 x 1000)1000 + 1000

Pattern observation: Without cache, work doubles with each action; with cache, work is done once and reused.

Final Time Complexity

Time Complexity: O(n)

This means the time grows linearly with data size, but caching avoids repeating the same work multiple times.

Common Mistake

[X] Wrong: "Caching makes the operation run instantly regardless of data size."

[OK] Correct: Caching only saves time on repeated actions by storing results; the first run still processes all data.

Interview Connect

Understanding caching shows you can make data work faster by avoiding repeated heavy tasks, a key skill in data projects.

Self-Check

"What if we used persistence with a different storage level instead of cache? How would the time complexity change?"