Bash Script to Remove Duplicate Lines from File
sort -u filename or a Bash script with awk '!seen[$0]++' filename to remove duplicate lines from a file.Examples
How to Think About It
sort -u sort the file and remove duplicates, while awk can track seen lines without sorting.Algorithm
Code
#!/bin/bash # Remove duplicate lines from input file input_file="$1" if [[ ! -f "$input_file" ]]; then echo "File not found: $input_file" exit 1 fi awk '!seen[$0]++' "$input_file"
Dry Run
Let's trace the input lines 'apple', 'banana', 'apple', 'orange', 'banana' through the awk command.
Read first line
Line: 'apple', seen['apple'] is 0, print 'apple', set seen['apple']=1
Read second line
Line: 'banana', seen['banana'] is 0, print 'banana', set seen['banana']=1
Read third line
Line: 'apple', seen['apple'] is 1, skip line
Read fourth line
Line: 'orange', seen['orange'] is 0, print 'orange', set seen['orange']=1
Read fifth line
Line: 'banana', seen['banana'] is 1, skip line
| Line | Seen Before? | Action |
|---|---|---|
| apple | No | |
| banana | No | |
| apple | Yes | Skip |
| orange | No | |
| banana | Yes | Skip |
Why This Works
Step 1: Tracking seen lines
The awk script uses an array seen indexed by the line content to track if a line appeared before.
Step 2: Condition to print
The expression !seen[$0]++ returns true only the first time a line is seen, so it prints unique lines.
Step 3: No sorting needed
Unlike sort -u, this method preserves the original order of lines while removing duplicates.
Alternative Approaches
sort -u input.txt
sort input.txt | uniq
grep -vxf <(sort -u input.txt) input.txt
Complexity: O(n log n) time, O(n) space
Time Complexity
The awk method runs in O(n) time as it processes each line once. The sort -u method takes O(n log n) due to sorting.
Space Complexity
Both methods use O(n) space to store seen lines or sort data. The awk method stores seen lines in memory.
Which Approach is Fastest?
For large files where order doesn't matter, sort -u is fastest. To keep order, awk is preferred despite slightly more memory use.
| Approach | Time | Space | Best For |
|---|---|---|---|
| awk '!seen[$0]++' | O(n) | O(n) | Removing duplicates while preserving order |
| sort -u | O(n log n) | O(n) | Fast removal when order doesn't matter |
| sort + uniq | O(n log n) | O(n) | Legacy method, requires sorting first |
awk '!seen[$0]++' to remove duplicates while keeping original line order.uniq causes it to only remove consecutive duplicates, missing others.