Spark architecture (driver, executors, cluster manager) in Apache Spark - Time & Space Complexity
We want to understand how the work in Spark grows as the data or tasks grow.
How does Spark's architecture affect the time it takes to run jobs?
Analyze the time complexity of this Spark job setup.
// Spark job setup example
val conf = new SparkConf().setAppName("ExampleApp")
val sc = new SparkContext(conf)
val data = sc.textFile("hdfs://data/input.txt")
val words = data.flatMap(line => line.split(" "))
val wordCounts = words.map(word => (word, 1)).reduceByKey(_ + _)
wordCounts.collect()
This code reads data, splits lines into words, counts each word, and collects results.
Look at what repeats as data grows.
- Primary operation: Processing each line and word in the dataset.
- How many times: Once for each line and word in the input data.
More data means more lines and words to process.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 lines | Processes about 10 lines and their words |
| 100 lines | Processes about 100 lines and their words |
| 1000 lines | Processes about 1000 lines and their words |
Pattern observation: The work grows roughly in direct proportion to the amount of data.
Time Complexity: O(n)
This means the time to complete the job grows linearly with the size of the input data.
[X] Wrong: "Adding more executors will always make the job run instantly."
[OK] Correct: Because some parts like the driver coordination and data shuffling still take time and can limit speed.
Understanding how Spark handles data and tasks helps you explain how big data jobs scale and where delays can happen.
"What if we increased the number of partitions in the input data? How would the time complexity change?"