BigQuery for analytics in GCP - Time & Space Complexity
When using BigQuery for analytics, it is important to understand how query execution time changes as the amount of data grows.
We want to know how the time to get results changes when we analyze bigger datasets.
Analyze the time complexity of the following BigQuery SQL query.
SELECT user_id, COUNT(*) AS session_count
FROM `project.dataset.sessions`
WHERE event_date BETWEEN '2024-01-01' AND '2024-01-31'
GROUP BY user_id
ORDER BY session_count DESC
LIMIT 100
This query counts user sessions in January 2024, groups by user, and returns the top 100 users with the most sessions.
Look at what happens repeatedly during query execution.
- Primary operation: Scanning rows in the sessions table that match the date filter.
- How many times: Once over all matching rows, which depends on how many sessions happened in January.
- Secondary operation: Grouping rows by user_id and counting sessions per user.
- How many times: Once per unique user in the filtered data.
As the number of sessions in January grows, the query scans more rows and groups more data.
| Input Size (n) | Approx. Api Calls/Operations |
|---|---|
| 10,000 sessions | Scan 10,000 rows, group by users in those rows |
| 100,000 sessions | Scan 100,000 rows, group by users in those rows |
| 1,000,000 sessions | Scan 1,000,000 rows, group by users in those rows |
Pattern observation: The work grows roughly in direct proportion to the number of sessions scanned.
Time Complexity: O(n)
This means the query time grows linearly with the number of rows scanned in the filtered data.
[X] Wrong: "Adding a LIMIT 100 means the query only looks at 100 rows, so it runs fast regardless of data size."
[OK] Correct: The LIMIT applies after scanning and grouping all matching rows, so the query still processes all relevant data before limiting results.
Understanding how BigQuery processes data helps you explain performance trade-offs clearly and shows you can think about cost and speed in cloud analytics.
"What if we added a filter on user_id to only include a small subset? How would the time complexity change?"