Loading from S3, Azure Blob, GCS in Snowflake - Time & Space Complexity
When loading data from cloud storage like S3, Azure Blob, or GCS into Snowflake, it's important to understand how the time taken grows as the data size increases.
We want to know how the number of operations changes when we load more files or bigger files.
Analyze the time complexity of the following operation sequence.
COPY INTO my_table
FROM @my_stage
FILE_FORMAT = (TYPE = 'CSV')
PATTERN = '.*\\.csv'
ON_ERROR = 'CONTINUE';
This command loads all CSV files from a cloud storage stage into a Snowflake table.
Identify the API calls, resource provisioning, data transfers that repeat.
- Primary operation: Reading each file from cloud storage (S3, Azure Blob, or GCS) and loading it into Snowflake.
- How many times: Once per file in the stage matching the pattern.
As the number of files increases, Snowflake makes one read operation per file to load the data.
| Input Size (n files) | Approx. API Calls/Operations |
|---|---|
| 10 | About 10 file read operations |
| 100 | About 100 file read operations |
| 1000 | About 1000 file read operations |
Pattern observation: The number of operations grows directly with the number of files to load.
Time Complexity: O(n)
This means the time to load data grows linearly with the number of files being loaded.
[X] Wrong: "Loading many small files is just as fast as loading one big file of the same total size."
[OK] Correct: Each file requires a separate read operation, so many small files cause more overhead and take longer than one big file.
Understanding how loading scales with file count helps you design efficient data pipelines and shows you can think about real-world cloud data workflows.
"What if we compressed all files into one archive file before loading? How would the time complexity change?"