BigQuery SQL and pricing model in GCP - Time & Space Complexity
We want to understand how the cost and time of running BigQuery SQL queries grow as the amount of data increases.
Specifically, how does scanning more data affect the number of operations and pricing?
Analyze the time complexity of running a simple SELECT query scanning a large table.
SELECT user_id, COUNT(*) AS visits
FROM `project.dataset.web_logs`
WHERE event_date BETWEEN '2024-01-01' AND '2024-01-31'
GROUP BY user_id
ORDER BY visits DESC
LIMIT 10
This query scans web log data for one month, groups by user, and returns the top 10 visitors.
Look at what happens repeatedly during query execution.
- Primary operation: Scanning data blocks from storage.
- How many times: Once for each data block covering the date range scanned.
- Secondary operations: Grouping and sorting happen after scanning but depend on scanned data size.
As the amount of data scanned grows, the number of data blocks read grows roughly in proportion.
| Input Size (GB scanned) | Approx. Data Blocks Read |
|---|---|
| 10 | 10 blocks |
| 100 | 100 blocks |
| 1000 | 1000 blocks |
Pattern observation: The scanning cost grows linearly with the amount of data scanned.
Time Complexity: O(n)
This means the time and cost increase directly in proportion to the size of the data scanned.
[X] Wrong: "Query cost stays the same no matter how much data is scanned."
[OK] Correct: BigQuery charges and time depend on how much data the query reads, so scanning more data costs more time and money.
Understanding how query cost grows with data size helps you design efficient queries and manage cloud costs wisely.
What if we added a filter that reduces scanned data by half? How would the time complexity and cost change?