Creating DataFrames from files (CSV, JSON, Parquet) in Apache Spark - Performance & Efficiency
Loading data from files into DataFrames is a common first step in data work with Apache Spark.
We want to understand how the time to load data changes as the file size grows.
Analyze the time complexity of the following code snippet.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("LoadData").getOrCreate()
# Load CSV file into DataFrame
df = spark.read.csv("/path/to/file.csv", header=True, inferSchema=True)
# Show first few rows
df.show(5)
This code loads a CSV file into a Spark DataFrame and shows some rows.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Reading each row from the file and parsing it.
- How many times: Once for every row in the file (n times).
As the number of rows in the file grows, the time to read and parse grows roughly the same way.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | 10 row reads and parses |
| 100 | 100 row reads and parses |
| 1000 | 1000 row reads and parses |
Pattern observation: The work grows directly with the number of rows.
Time Complexity: O(n)
This means the time to load the file grows linearly with the number of rows in the file.
[X] Wrong: "Loading a file is instant no matter the size."
[OK] Correct: Each row must be read and parsed, so bigger files take more time.
Understanding how data loading scales helps you explain performance in real projects and shows you know how Spark handles big data.
"What if we load multiple files at once instead of one? How would the time complexity change?"