Bash-scriptingHow-ToBeginner · 2 min read

Bash Script to Remove Duplicate Lines from File

Use the command sort -u filename or a Bash script with awk '!seen[$0]++' filename to remove duplicate lines from a file.

📋

Examples

Inputapple banana apple orange banana

Outputapple banana orange

Inputline1 line2 line3 line2 line1 line4

Outputline1 line2 line3 line4

Input

Output

🧠

How to Think About It

To remove duplicate lines, think of reading each line and remembering if you have seen it before. If it's new, keep it; if it's a repeat, skip it. Commands like sort -u sort the file and remove duplicates, while awk can track seen lines without sorting.

📐

Algorithm

Read the file line by line.

Check if the line has been seen before.

If not seen, print the line and mark it as seen.

If seen, skip the line.

Continue until all lines are processed.

💻

Code

bash

#!/bin/bash

# Remove duplicate lines from input file
input_file="$1"

if [[ ! -f "$input_file" ]]; then
  echo "File not found: $input_file"
  exit 1
fi

awk '!seen[$0]++' "$input_file"

Output

apple banana orange

🔍

Dry Run

Let's trace the input lines 'apple', 'banana', 'apple', 'orange', 'banana' through the awk command.

Read first line

Line: 'apple', seen['apple'] is 0, print 'apple', set seen['apple']=1

Read second line

Line: 'banana', seen['banana'] is 0, print 'banana', set seen['banana']=1

Read third line

Line: 'apple', seen['apple'] is 1, skip line

Read fourth line

Line: 'orange', seen['orange'] is 0, print 'orange', set seen['orange']=1

Read fifth line

Line: 'banana', seen['banana'] is 1, skip line

Line	Seen Before?	Action
apple	No	Print
banana	No	Print
apple	Yes	Skip
orange	No	Print
banana	Yes	Skip

💡

Why This Works

Step 1: Tracking seen lines

The awk script uses an array seen indexed by the line content to track if a line appeared before.

Step 2: Condition to print

The expression !seen[$0]++ returns true only the first time a line is seen, so it prints unique lines.

Step 3: No sorting needed

Unlike sort -u, this method preserves the original order of lines while removing duplicates.

🔄

Alternative Approaches

Using sort command

bash

sort -u input.txt

Simple and fast but sorts lines alphabetically, changing original order.

Using uniq command

bash

sort input.txt | uniq

Removes duplicates but requires sorting first, so original order is lost.

Using grep with a temporary file

bash

grep -vxf <(sort -u input.txt) input.txt

More complex and less efficient; generally not recommended.

⚡

Complexity: O(n log n) time, O(n) space

Time Complexity

The awk method runs in O(n) time as it processes each line once. The sort -u method takes O(n log n) due to sorting.

Space Complexity

Both methods use O(n) space to store seen lines or sort data. The awk method stores seen lines in memory.

Which Approach is Fastest?

For large files where order doesn't matter, sort -u is fastest. To keep order, awk is preferred despite slightly more memory use.

Approach	Time	Space	Best For
awk '!seen[$0]++'	O(n)	O(n)	Removing duplicates while preserving order
sort -u	O(n log n)	O(n)	Fast removal when order doesn't matter
sort + uniq	O(n log n)	O(n)	Legacy method, requires sorting first

💡

Use awk '!seen[$0]++' to remove duplicates while keeping original line order.

⚠️

Forgetting to sort before using uniq causes it to only remove consecutive duplicates, missing others.