MongoDB Query to Find Duplicate Documents Easily
$group to group documents by the fields you want to check duplicates on, then filter groups with $match having count > 1. For example: db.collection.aggregate([{ $group: { _id: { field1: "$field1", field2: "$field2" }, count: { $sum: 1 } } }, { $match: { count: { $gt: 1 } } }]).Examples
How to Think About It
$group. Then count how many documents fall into each group. Finally, filter groups where the count is more than one using $match to get only duplicates.Algorithm
Code
db.collection.aggregate([
{ $group: { _id: { field1: "$field1", field2: "$field2" }, count: { $sum: 1 } } },
{ $match: { count: { $gt: 1 } } }
])Dry Run
Let's trace a collection with documents having fields field1 and field2 to find duplicates.
Group documents
Group documents by field1 and field2, counting how many times each combination appears.
Filter duplicates
Keep only groups where count is greater than 1.
| _id | count |
|---|---|
| {field1: 'value1', field2: 'value2'} | 3 |
| {field1: 'value3', field2: 'value4'} | 2 |
| {field1: 'value5', field2: 'value6'} | 1 |
Why This Works
Step 1: Grouping documents
The $group stage groups documents by the specified fields, creating buckets for each unique combination.
Step 2: Counting documents
Inside each group, $sum: 1 counts how many documents belong to that group.
Step 3: Filtering duplicates
The $match stage filters groups to keep only those with a count greater than 1, which means duplicates exist.
Alternative Approaches
const values = db.collection.distinct('field1'); values.forEach(val => { const count = db.collection.countDocuments({ field1: val }); if(count > 1) print(val + ' is duplicated'); });
db.collection.mapReduce( function() { emit(this.field1, 1); }, function(key, values) { return Array.sum(values); }, { query: {}, out: { inline: 1 } } );
Complexity: O(n) time, O(k) space
Time Complexity
The aggregation scans all documents once, so time is proportional to the number of documents n.
Space Complexity
Space depends on the number of unique groups k created by the grouping fields.
Which Approach is Fastest?
Aggregation is faster and more efficient than distinct with multiple queries or Map-Reduce for large datasets.
| Approach | Time | Space | Best For |
|---|---|---|---|
| Aggregation with $group | O(n) | O(k) | Large datasets, efficient duplicate detection |
| Distinct + countDocuments | O(n*m) | O(1) | Small datasets, simple scripts |
| Map-Reduce | O(n) | O(k) | Complex processing, legacy support |