Overview - HashSet for unique elements

What is it?

A HashSet is a collection in C# that stores unique elements only. It automatically prevents duplicates, so each item appears once. You can add, remove, and check for items quickly. It is useful when you want to keep a list without repeated values.

Why it matters

Without a HashSet, you would have to manually check for duplicates when adding items, which is slow and error-prone. HashSet makes it easy and fast to keep only unique items, saving time and avoiding bugs. This helps in tasks like filtering data, tracking unique users, or managing sets of options.

Where it fits

Before learning HashSet, you should understand basic collections like arrays and lists. After HashSet, you can explore other set operations like intersections and unions, or learn about dictionaries for key-value pairs.

Mental Model

Core Idea

A HashSet is like a special box that only lets you keep one copy of each item, ignoring duplicates automatically.

Think of it like...

Imagine a guest list for a party where each name can only appear once. If someone tries to add the same name again, the list stays the same. The HashSet works like that guest list, ensuring no duplicate names.

HashSet Structure:
┌───────────────┐
│   HashSet     │
│ ┌───────────┐ │
│ │ Unique    │ │
│ │ Elements  │ │
│ └───────────┘ │
│ Add()        │
│ Remove()     │
│ Contains()   │
└───────────────┘

Build-Up - 6 Steps

1

FoundationWhat is a HashSet in C#

Concept: Introducing the HashSet collection and its purpose.

In C#, a HashSet is a collection that stores unique elements of type T. When you add an item, it checks if it already exists. If yes, it ignores the new addition. This means no duplicates can exist in the HashSet.

Result

You get a collection that automatically filters out repeated items.

Understanding that HashSet enforces uniqueness by design helps you avoid manual duplicate checks.

2

FoundationBasic HashSet operations

3

IntermediateHow HashSet prevents duplicates

4

IntermediateSet operations with HashSet

5

AdvancedCustomizing uniqueness with IEqualityComparer

6

ExpertHashSet internal resizing and performance

Under the Hood

HashSet stores items in an array of buckets indexed by the hash code of each item. When adding or searching, it computes the hash code, finds the bucket, and checks for equality with existing items. If a collision occurs (different items with same hash), it uses a linked list or similar structure inside the bucket to store multiple items. When the load factor (items per bucket) grows too high, HashSet resizes by creating a bigger bucket array and rehashing all items.

Why designed this way?

HashSet was designed to provide very fast membership tests and insertions, unlike lists that scan all items. Using hashing allows near constant time operations. The resizing balances memory use and speed. Alternatives like balanced trees exist but are slower for simple uniqueness checks. HashSet's design is a tradeoff optimized for average fast performance.

HashSet Internal Structure:

[Item] --hash--> [Bucket Array]
┌───────────────┐
│ Bucket 0      │ -> Item A
│ Bucket 1      │ -> Item B -> Item C (collision)
│ Bucket 2      │ -> empty
│ ...           │
└───────────────┘

Resize triggers when buckets fill up:
Old Buckets -> New Larger Buckets
Rehash all items to new buckets

Myth Busters - 4 Common Misconceptions

Quick: Does HashSet preserve the order of items added? Commit to yes or no.

Common Belief:HashSet keeps items in the order you add them.

Tap to reveal reality

Quick: Can HashSet store multiple identical items if you add them repeatedly? Commit to yes or no.

Common Belief:HashSet allows duplicates if you add the same item multiple times.

Tap to reveal reality

Quick: Does HashSet use the object's memory address to check uniqueness? Commit to yes or no.

Common Belief:HashSet checks uniqueness by comparing memory addresses of objects.

Tap to reveal reality

Quick: Is HashSet always faster than a List for all operations? Commit to yes or no.

Common Belief:HashSet is always faster than List for any operation.

Tap to reveal reality

Expert Zone

1

HashSet's performance depends heavily on the quality of the hash function; poor hash functions cause many collisions and slow operations.

2

When using mutable objects as keys, changing their state after adding to HashSet can break uniqueness guarantees and cause hard-to-find bugs.

3

Pre-sizing a HashSet with the expected number of elements reduces costly resizing and improves performance in large data scenarios.

When NOT to use

Avoid HashSet when you need to preserve insertion order; use OrderedSet or List instead. Also, if you need key-value pairs, use Dictionary. For small collections where performance is not critical, a List with manual checks might be simpler.

Production Patterns

In real systems, HashSet is used for filtering duplicates from large data streams, managing unique user IDs, implementing fast lookups in caching layers, and performing set operations in algorithms like graph traversal or recommendation engines.

Connections

Dictionary

HashSet is like a Dictionary without values, storing only keys uniquely.

Understanding HashSet helps grasp how Dictionary manages keys and values efficiently.

Mathematical Set Theory

HashSet implements the concept of a mathematical set with unique elements and set operations.

Knowing set theory clarifies why operations like union and intersection behave as they do in HashSet.

Database Indexing

HashSet's hashing mechanism is similar to how database indexes quickly find records.

Recognizing this connection helps understand performance optimization in both programming and databases.

Common Pitfalls

#1Assuming HashSet preserves the order of added items.

Wrong approach:var set = new HashSet(); set.Add(3); set.Add(1); set.Add(2); foreach(var item in set) { Console.WriteLine(item); } // expects 3,1,2

Correct approach:var list = new List{3,1,2}; foreach(var item in list) { Console.WriteLine(item); } // preserves order

Root cause:Misunderstanding that HashSet is unordered and does not track insertion sequence.

#2Using mutable objects as HashSet elements and modifying them after insertion.

Wrong approach:class Person { public string Name; public override int GetHashCode() => Name.GetHashCode(); public override bool Equals(object obj) => ((Person)obj).Name == Name; } var set = new HashSet(); var p = new Person { Name = "Alice" }; set.Add(p); p.Name = "Bob"; // changes hash code bool contains = set.Contains(p); // returns false unexpectedly

Correct approach:Use immutable objects or avoid changing properties used in GetHashCode and Equals after adding to HashSet.

Root cause:Changing object state breaks the hash code and equality contract required by HashSet.

#3Not providing a custom comparer when needed, causing unexpected duplicates.

Wrong approach:var set = new HashSet(); set.Add("apple"); set.Add("APPLE"); // both added, duplicates ignored

Correct approach:var set = new HashSet(StringComparer.OrdinalIgnoreCase); set.Add("apple"); set.Add("APPLE"); // second ignored as duplicate

Root cause:Ignoring case sensitivity or custom equality needs leads to logical duplicates.

Key Takeaways

HashSet is a collection that automatically keeps only unique elements, preventing duplicates.

It uses hashing to quickly add, remove, and check items, making it faster than lists for uniqueness tasks.

HashSet does not preserve the order of items; it focuses on uniqueness and speed.

Custom equality comparers let you define what 'unique' means for your data types.

Understanding HashSet internals helps avoid common bugs and optimize performance in real applications.