Databricks platform overview in Apache Spark - Time & Space Complexity
We want to understand how the time needed to run tasks on Databricks changes as the data size grows.
How does the platform handle bigger data and more operations?
Analyze the time complexity of the following Apache Spark code running on Databricks.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Example').getOrCreate()
df = spark.read.csv('/data/large_dataset.csv', header=True, inferSchema=True)
result = df.groupBy('category').count().orderBy('count', ascending=False)
result.show()
This code reads a large CSV file, groups data by a column, counts entries per group, and sorts the results.
Look at what repeats as data size grows.
- Primary operation: Grouping and counting rows by category.
- How many times: Once per row in the dataset, so as many times as there are rows.
As the number of rows increases, the grouping and counting take more time roughly proportional to the number of rows.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 10 grouping and counting steps |
| 100 | About 100 grouping and counting steps |
| 1000 | About 1000 grouping and counting steps |
Pattern observation: The work grows roughly in direct proportion to the number of rows.
Time Complexity: O(n)
This means the time to run grows linearly as the data size grows.
[X] Wrong: "Grouping and counting will take the same time no matter how big the data is."
[OK] Correct: The operation must look at each row, so more rows mean more work and more time.
Understanding how data operations scale helps you explain your approach to handling big data efficiently.
"What if we added a filter before grouping? How would that change the time complexity?"