sort and uniq in pipelines in Bash Scripting - Time & Space Complexity
When using sort and uniq together in a pipeline, it is important to understand how the time taken grows as the input size increases.
We want to know how the commands handle larger lists of data and how their work adds up.
Analyze the time complexity of the following code snippet.
sort input.txt | uniq > output.txt
This code sorts the lines of input.txt and then removes duplicate lines, saving the result to output.txt.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Sorting all lines in the input file.
- How many times: The sorting process compares and rearranges lines multiple times, depending on input size.
- Secondary operation: The
uniqcommand scans the sorted output once to remove duplicates. - How many times: It reads through the sorted lines one by one.
As the number of lines grows, sorting takes more time because it must compare and reorder many lines.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 30 to 50 comparisons and rearrangements |
| 100 | About 700 to 1,000 comparisons and rearrangements |
| 1000 | About 10,000 to 15,000 comparisons and rearrangements |
Pattern observation: The sorting work grows faster than the input size, roughly multiplying by more than ten when input grows ten times. The uniq step grows linearly, reading each line once.
Time Complexity: O(n log n)
This means the time to sort and remove duplicates grows a bit faster than the number of lines, mainly because sorting takes more work as the list gets bigger.
[X] Wrong: "The uniq command alone removes duplicates from any list quickly without sorting first."
[OK] Correct: uniq only removes duplicates if they are next to each other. Without sorting first, duplicates scattered in the file won't be removed.
Understanding how sorting and filtering commands scale helps you reason about data processing in real scripts. This skill shows you can think about efficiency when working with pipelines.
What if we replaced sort with a command that only partially sorts the data? How would the time complexity change?