Loading from S3, Azure Blob, GCS in Snowflake - Time & Space Complexity
Start learning this pattern below
Jump into concepts and practice - no test required
When loading data from cloud storage like S3, Azure Blob, or GCS into Snowflake, it's important to understand how the time taken grows as the data size increases.
We want to know how the number of operations changes when we load more files or bigger files.
Analyze the time complexity of the following operation sequence.
COPY INTO my_table
FROM @my_stage
FILE_FORMAT = (TYPE = 'CSV')
PATTERN = '.*\\.csv'
ON_ERROR = 'CONTINUE';
This command loads all CSV files from a cloud storage stage into a Snowflake table.
Identify the API calls, resource provisioning, data transfers that repeat.
- Primary operation: Reading each file from cloud storage (S3, Azure Blob, or GCS) and loading it into Snowflake.
- How many times: Once per file in the stage matching the pattern.
As the number of files increases, Snowflake makes one read operation per file to load the data.
| Input Size (n files) | Approx. API Calls/Operations |
|---|---|
| 10 | About 10 file read operations |
| 100 | About 100 file read operations |
| 1000 | About 1000 file read operations |
Pattern observation: The number of operations grows directly with the number of files to load.
Time Complexity: O(n)
This means the time to load data grows linearly with the number of files being loaded.
[X] Wrong: "Loading many small files is just as fast as loading one big file of the same total size."
[OK] Correct: Each file requires a separate read operation, so many small files cause more overhead and take longer than one big file.
Understanding how loading scales with file count helps you design efficient data pipelines and shows you can think about real-world cloud data workflows.
"What if we compressed all files into one archive file before loading? How would the time complexity change?"
Practice
COPY INTO in Snowflake when loading data from S3, Azure Blob, or GCS?Solution
Step 1: Understand the role of COPY INTO
TheCOPY INTOcommand is used in Snowflake to load data from external cloud storage into Snowflake tables.Step 2: Differentiate from other operations
Exporting data, creating buckets, or deleting files are not done byCOPY INTO. It specifically loads data into tables.Final Answer:
To load data files from cloud storage into Snowflake tables -> Option AQuick Check:
Loading data = COPY INTO [OK]
- Confusing COPY INTO with export commands
- Thinking COPY INTO manages cloud storage buckets
- Assuming COPY INTO deletes files
my_table?Solution
Step 1: Identify correct Snowflake COPY INTO syntax
Snowflake usesCOPY INTO table_name FROM @stage FILE_FORMAT = (TYPE = 'format')to load data.Step 2: Eliminate incorrect options
LOAD DATA INTOis not valid Snowflake syntax. DirectINSERT INTO ... FROM s3://paths are not supported.IMPORT INTOdoes not exist. The correct syntax isCOPY INTO my_table FROM @my_s3_stage FILE_FORMAT = (TYPE = 'CSV');.Final Answer:
COPY INTO my_table FROM @my_s3_stage FILE_FORMAT = (TYPE = 'CSV'); -> Option AQuick Check:
COPY INTO + stage + file format = correct syntax [OK]
- Using LOAD DATA instead of COPY INTO
- Trying to SELECT directly from S3 path
- Using IMPORT INTO which is invalid
COPY INTO sales FROM @azure_blob_stage FILE_FORMAT = (TYPE = 'JSON') ON_ERROR = 'CONTINUE';
What happens if one file in the Azure Blob storage has invalid JSON data?
Solution
Step 1: Understand ON_ERROR = 'CONTINUE'
This option tells Snowflake to skip files or rows with errors and continue loading the rest.Step 2: Apply to invalid JSON file
The invalid JSON file will be skipped, but other valid files will load successfully.Final Answer:
Only the invalid file is skipped, and loading continues for others -> Option CQuick Check:
ON_ERROR = CONTINUE skips errors, loads rest [OK]
- Assuming entire load fails on one bad file
- Thinking Snowflake auto-fixes JSON errors
- Believing invalid data is loaded anyway
COPY INTO customers FROM @gcs_stage FILE_FORMAT = (TYPE = 'CSV');
But you get an error saying 'Storage integration not authorized'. What is the most likely cause?
Solution
Step 1: Analyze the error message
'Storage integration not authorized' means Snowflake cannot access the cloud storage due to permission issues.Step 2: Identify cause
The storage integration must have proper permissions to read from the GCS bucket. Other options do not cause authorization errors.Final Answer:
The storage integration lacks permission to access the GCS bucket -> Option DQuick Check:
Authorization error = permission issue [OK]
- Blaming file format for authorization errors
- Assuming table existence causes storage errors
- Ignoring permission setup for storage integration
2024/. Which COPY INTO command correctly filters these files?Solution
Step 1: Understand file filtering in COPY INTO
Snowflake uses thePATTERNparameter with a regular expression to filter files by name or prefix.Step 2: Check regex correctness
PATTERN = '^2024/.*'matches files starting exactly with '2024/'.PATTERN = '2024/.*'without the ^ may match files where '2024/' appears elsewhere in the path. The other options use invalid parameters like WHERE or FILES.Final Answer:
COPY INTO sales FROM @s3_stage FILE_FORMAT = (TYPE = 'CSV') PATTERN = '^2024/.*'; -> Option BQuick Check:
PATTERN with ^ prefix filters files correctly [OK]
- Omitting ^ in regex causing wrong files to load
- Using WHERE or FILES incorrectly for filtering
- Confusing file prefix with file list
