0
0
Bash-scriptingHow-ToBeginner · 2 min read

Bash Script to Find Duplicate Files Quickly

Use the Bash command find . -type f -exec md5sum {} + | sort | uniq -w32 -d --all-repeated=separate to find duplicate files by their content hash in the current directory and its subdirectories.
📋

Examples

InputDirectory with files: a.txt (content: 'hello'), b.txt (content: 'hello'), c.txt (content: 'world')
Output5d41402abc4b2a76b9719d911017c592 a.txt 5d41402abc4b2a76b9719d911017c592 b.txt
InputDirectory with files: file1.txt (content: 'data'), file2.txt (content: 'data'), file3.txt (content: 'data'), file4.txt (content: 'unique')
Output8d777f385d3dfec8815d20f7496026dc file1.txt 8d777f385d3dfec8815d20f7496026dc file2.txt 8d777f385d3dfec8815d20f7496026dc file3.txt
InputEmpty directory
Output
🧠

How to Think About It

To find duplicate files, first get a unique fingerprint of each file's content using a hash function like md5sum. Then group files by their hash values. Files sharing the same hash are duplicates. Sorting and filtering these hashes helps identify duplicates easily.
📐

Algorithm

1
Find all files recursively in the target directory.
2
Calculate a hash (md5sum) for each file to represent its content.
3
Sort the list of hashes and filenames to group duplicates together.
4
Use uniq to filter and show only hashes that appear more than once.
5
Print the duplicate files grouped by their hash.
💻

Code

bash
find . -type f -exec md5sum {} + | sort | uniq -w32 -d --all-repeated=separate
Output
5d41402abc4b2a76b9719d911017c592 ./a.txt 5d41402abc4b2a76b9719d911017c592 ./b.txt
🔍

Dry Run

Let's trace finding duplicates in a directory with files a.txt and b.txt having the same content 'hello'.

1

Find files

Files found: ./a.txt, ./b.txt

2

Calculate hashes

md5sum ./a.txt -> 5d41402abc4b2a76b9719d911017c592 md5sum ./b.txt -> 5d41402abc4b2a76b9719d911017c592

3

Sort and find duplicates

Sorted hashes: 5d41402abc4b2a76b9719d911017c592 ./a.txt 5d41402abc4b2a76b9719d911017c592 ./b.txt uniq finds duplicates with same hash

Hash and File
5d41402abc4b2a76b9719d911017c592 ./a.txt
5d41402abc4b2a76b9719d911017c592 ./b.txt
💡

Why This Works

Step 1: Hashing files

Using md5sum creates a unique fingerprint for each file's content, so identical files have the same hash.

Step 2: Sorting hashes

Sorting groups identical hashes together, making it easier to spot duplicates.

Step 3: Filtering duplicates

The uniq -d command shows only hashes that appear more than once, revealing duplicate files.

🔄

Alternative Approaches

Using sha256sum instead of md5sum
bash
find . -type f -exec sha256sum {} + | sort | uniq -w64 -d --all-repeated=separate
sha256sum is more secure and less prone to collisions but slower than md5sum.
Using fdupes tool
bash
fdupes -r .
fdupes is a dedicated tool for finding duplicates with options for deletion, but it requires installation.
Using find with cksum
bash
find . -type f -exec cksum {} + | sort | uniq -w10 -d --all-repeated=separate
cksum is faster but less reliable for collisions compared to md5sum.

Complexity: O(n log n) time, O(n) space

Time Complexity

Calculating hashes for n files is O(n). Sorting the hashes is O(n log n), which dominates the runtime.

Space Complexity

Storing hashes and filenames requires O(n) space proportional to the number of files.

Which Approach is Fastest?

Using cksum is faster but less reliable. md5sum balances speed and accuracy. Dedicated tools like fdupes add features but may be slower.

ApproachTimeSpaceBest For
md5sum + uniqO(n log n)O(n)Reliable duplicate detection
sha256sum + uniqO(n log n)O(n)More secure hash, slower
cksum + uniqO(n log n)O(n)Faster but less collision-resistant
fdupes toolVariesVariesFeature-rich duplicate management
💡
Use md5sum with uniq -d to quickly spot duplicate files by content.
⚠️
Beginners often compare filenames instead of file content, missing duplicates with different names.