Row key design strategies in Hadoop - Time & Space Complexity
When working with Hadoop, how we design row keys affects how fast data can be found and stored.
We want to know how the choice of row keys changes the work Hadoop does as data grows.
Analyze the time complexity of this simple row key scan in Hadoop.
Scan scan = new Scan();
scan.setStartRow(startKey);
scan.setStopRow(stopKey);
ResultScanner scanner = table.getScanner(scan);
for (Result result : scanner) {
// process each row
}
scanner.close();
This code scans rows between two keys and processes each row found.
Look at what repeats as data grows.
- Primary operation: Scanning rows between startKey and stopKey.
- How many times: Once per row in the key range, which depends on how keys are distributed.
As the number of rows between startKey and stopKey grows, the scan takes longer.
| Input Size (rows scanned) | Approx. Operations |
|---|---|
| 10 | 10 row reads |
| 100 | 100 row reads |
| 1000 | 1000 row reads |
Pattern observation: The work grows directly with how many rows the scan covers.
Time Complexity: O(n)
This means the time to scan grows in a straight line with the number of rows scanned.
[X] Wrong: "Choosing any row key design will give the same scan speed."
[OK] Correct: If row keys are not well designed, data can cluster badly, causing scans to read many unwanted rows and slow down.
Understanding how row key design affects scan time shows you know how data layout impacts performance in big data systems.
"What if we changed the row key to include a timestamp prefix? How would that affect the scan time complexity?"