0
0
Bash-scriptingHow-ToBeginner · 2 min read

Bash Script to Remove Duplicate Lines from File

Use the command sort -u filename or a Bash script with awk '!seen[$0]++' filename to remove duplicate lines from a file.
📋

Examples

Inputapple banana apple orange banana
Outputapple banana orange
Inputline1 line2 line3 line2 line1 line4
Outputline1 line2 line3 line4
Input
Output
🧠

How to Think About It

To remove duplicate lines, think of reading each line and remembering if you have seen it before. If it's new, keep it; if it's a repeat, skip it. Commands like sort -u sort the file and remove duplicates, while awk can track seen lines without sorting.
📐

Algorithm

1
Read the file line by line.
2
Check if the line has been seen before.
3
If not seen, print the line and mark it as seen.
4
If seen, skip the line.
5
Continue until all lines are processed.
💻

Code

bash
#!/bin/bash

# Remove duplicate lines from input file
input_file="$1"

if [[ ! -f "$input_file" ]]; then
  echo "File not found: $input_file"
  exit 1
fi

awk '!seen[$0]++' "$input_file"
Output
apple banana orange
🔍

Dry Run

Let's trace the input lines 'apple', 'banana', 'apple', 'orange', 'banana' through the awk command.

1

Read first line

Line: 'apple', seen['apple'] is 0, print 'apple', set seen['apple']=1

2

Read second line

Line: 'banana', seen['banana'] is 0, print 'banana', set seen['banana']=1

3

Read third line

Line: 'apple', seen['apple'] is 1, skip line

4

Read fourth line

Line: 'orange', seen['orange'] is 0, print 'orange', set seen['orange']=1

5

Read fifth line

Line: 'banana', seen['banana'] is 1, skip line

LineSeen Before?Action
appleNoPrint
bananaNoPrint
appleYesSkip
orangeNoPrint
bananaYesSkip
💡

Why This Works

Step 1: Tracking seen lines

The awk script uses an array seen indexed by the line content to track if a line appeared before.

Step 2: Condition to print

The expression !seen[$0]++ returns true only the first time a line is seen, so it prints unique lines.

Step 3: No sorting needed

Unlike sort -u, this method preserves the original order of lines while removing duplicates.

🔄

Alternative Approaches

Using sort command
bash
sort -u input.txt
Simple and fast but sorts lines alphabetically, changing original order.
Using uniq command
bash
sort input.txt | uniq
Removes duplicates but requires sorting first, so original order is lost.
Using grep with a temporary file
bash
grep -vxf <(sort -u input.txt) input.txt
More complex and less efficient; generally not recommended.

Complexity: O(n log n) time, O(n) space

Time Complexity

The awk method runs in O(n) time as it processes each line once. The sort -u method takes O(n log n) due to sorting.

Space Complexity

Both methods use O(n) space to store seen lines or sort data. The awk method stores seen lines in memory.

Which Approach is Fastest?

For large files where order doesn't matter, sort -u is fastest. To keep order, awk is preferred despite slightly more memory use.

ApproachTimeSpaceBest For
awk '!seen[$0]++'O(n)O(n)Removing duplicates while preserving order
sort -uO(n log n)O(n)Fast removal when order doesn't matter
sort + uniqO(n log n)O(n)Legacy method, requires sorting first
💡
Use awk '!seen[$0]++' to remove duplicates while keeping original line order.
⚠️
Forgetting to sort before using uniq causes it to only remove consecutive duplicates, missing others.