Bash Script to Find Duplicate Files Quickly
find . -type f -exec md5sum {} + | sort | uniq -w32 -d --all-repeated=separate to find duplicate files by their content hash in the current directory and its subdirectories.Examples
How to Think About It
Algorithm
Code
find . -type f -exec md5sum {} + | sort | uniq -w32 -d --all-repeated=separateDry Run
Let's trace finding duplicates in a directory with files a.txt and b.txt having the same content 'hello'.
Find files
Files found: ./a.txt, ./b.txt
Calculate hashes
md5sum ./a.txt -> 5d41402abc4b2a76b9719d911017c592 md5sum ./b.txt -> 5d41402abc4b2a76b9719d911017c592
Sort and find duplicates
Sorted hashes: 5d41402abc4b2a76b9719d911017c592 ./a.txt 5d41402abc4b2a76b9719d911017c592 ./b.txt uniq finds duplicates with same hash
| Hash and File |
|---|
| 5d41402abc4b2a76b9719d911017c592 ./a.txt |
| 5d41402abc4b2a76b9719d911017c592 ./b.txt |
Why This Works
Step 1: Hashing files
Using md5sum creates a unique fingerprint for each file's content, so identical files have the same hash.
Step 2: Sorting hashes
Sorting groups identical hashes together, making it easier to spot duplicates.
Step 3: Filtering duplicates
The uniq -d command shows only hashes that appear more than once, revealing duplicate files.
Alternative Approaches
find . -type f -exec sha256sum {} + | sort | uniq -w64 -d --all-repeated=separatefdupes -r .
find . -type f -exec cksum {} + | sort | uniq -w10 -d --all-repeated=separateComplexity: O(n log n) time, O(n) space
Time Complexity
Calculating hashes for n files is O(n). Sorting the hashes is O(n log n), which dominates the runtime.
Space Complexity
Storing hashes and filenames requires O(n) space proportional to the number of files.
Which Approach is Fastest?
Using cksum is faster but less reliable. md5sum balances speed and accuracy. Dedicated tools like fdupes add features but may be slower.
| Approach | Time | Space | Best For |
|---|---|---|---|
| md5sum + uniq | O(n log n) | O(n) | Reliable duplicate detection |
| sha256sum + uniq | O(n log n) | O(n) | More secure hash, slower |
| cksum + uniq | O(n log n) | O(n) | Faster but less collision-resistant |
| fdupes tool | Varies | Varies | Feature-rich duplicate management |
md5sum with uniq -d to quickly spot duplicate files by content.