Overview - DISTINCT for unique values

What is it?

DISTINCT is a keyword in SQL used to remove duplicate rows from the results of a query. When you select data from a table, sometimes the same value appears multiple times. Using DISTINCT ensures that each value appears only once in the output. This helps you find unique entries easily without extra work.

Why it matters

Without DISTINCT, queries might return repeated data, making it hard to understand or analyze results. For example, if you want to know all the different cities where customers live, duplicates would clutter the list. DISTINCT solves this by filtering out repeats, saving time and reducing errors in reports or decisions.

Where it fits

Before learning DISTINCT, you should understand basic SQL SELECT queries and how tables store data. After mastering DISTINCT, you can explore GROUP BY for grouping data and aggregate functions like COUNT or SUM to analyze unique groups.

Mental Model

Core Idea

DISTINCT filters query results to show only unique rows, removing duplicates.

Think of it like...

Imagine you have a basket full of apples, oranges, and apples again. DISTINCT is like picking one fruit of each kind to show you only the unique types, ignoring repeats.

┌───────────────┐
│ Query Result  │
│ ┌───────────┐ │
│ │ Apple     │ │
│ │ Orange    │ │
│ │ Apple     │ │
│ │ Banana    │ │
│ │ Orange    │ │
│ └───────────┘ │
│               │
│ DISTINCT applied →
│ ┌───────────┐  
│ │ Apple     │  
│ │ Orange    │  
│ │ Banana    │  
│ └───────────┘  
└───────────────┘

Build-Up - 7 Steps

1

FoundationBasic SELECT query review

Concept: Understanding how to retrieve data from a table using SELECT.

In SQL, SELECT is used to get data from a table. For example, SELECT city FROM customers; will show all city names from the customers table, including repeats.

Result

A list of city names, possibly with duplicates.

Knowing how SELECT works is essential before learning how to filter duplicates with DISTINCT.

2

FoundationWhat duplicates mean in query results

3

IntermediateUsing DISTINCT to remove duplicates

4

IntermediateDISTINCT with multiple columns

5

IntermediateDISTINCT vs GROUP BY comparison

6

AdvancedPerformance considerations with DISTINCT

7

ExpertDISTINCT with NULL values behavior

Under the Hood

When you run a query with DISTINCT, the database engine scans the selected rows and compares them to find duplicates. It often sorts the data or uses a hash structure to detect repeated rows. Once duplicates are identified, only one copy is kept in the output. This process happens after the data is fetched but before the final result is returned.

Why designed this way?

DISTINCT was designed to simplify the common need to find unique values without writing complex code. Sorting or hashing is efficient for duplicate detection. Alternatives like manual filtering would be slower and error-prone. The design balances simplicity for users and performance for databases.

┌───────────────┐
│ Query Input   │
│ SELECT city   │
│ FROM table    │
│ WITH DISTINCT │
└──────┬────────┘
       │
       ▼
┌───────────────┐
│ Data Fetch    │
│ All rows read │
└──────┬────────┘
       │
       ▼
┌───────────────┐
│ Duplicate     │
│ Detection     │
│ (Sort/Hash)   │
└──────┬────────┘
       │
       ▼
┌───────────────┐
│ Unique Rows   │
│ Output Result │
└───────────────┘

Myth Busters - 4 Common Misconceptions

Quick: Does DISTINCT remove duplicates based on one column or all selected columns? Commit to your answer.

Common Belief:DISTINCT removes duplicates based on each column separately.

Tap to reveal reality

Quick: Do multiple NULL values appear multiple times or once with DISTINCT? Commit to your answer.

Common Belief:DISTINCT treats each NULL as unique, so multiple NULLs appear in results.

Tap to reveal reality

Quick: Does DISTINCT always improve query speed? Commit to your answer.

Common Belief:DISTINCT makes queries faster by reducing data.

Tap to reveal reality

Quick: Can DISTINCT be used to count unique values directly? Commit to your answer.

Common Belief:DISTINCT counts unique values by itself.

Tap to reveal reality

Expert Zone

1

DISTINCT applies to the entire row of selected columns, so adding or removing columns changes uniqueness drastically.

2

Indexes on columns used with DISTINCT can improve performance but may not always be used depending on query complexity.

3

DISTINCT interacts with NULL values uniquely, treating all NULLs as one, which differs from some programming languages' handling of nulls.

When NOT to use

Avoid DISTINCT when you need to count or aggregate data; use GROUP BY with aggregate functions instead. Also, for very large datasets where performance is critical, consider data deduplication before querying or use indexed views.

Production Patterns

In real systems, DISTINCT is often used in reporting queries to list unique categories or tags. It is combined with WHERE clauses to filter data before deduplication. Sometimes, developers replace DISTINCT with GROUP BY for better control or performance tuning.

Connections

Set Theory

DISTINCT corresponds to the concept of a set containing unique elements.

Understanding DISTINCT as creating a set helps grasp why duplicates are removed and how SQL results relate to mathematical sets.

Data Deduplication in Storage Systems

Both remove repeated data to save space or improve clarity.

Knowing how storage systems deduplicate data deepens appreciation for DISTINCT's role in cleaning query results.

Filtering in Functional Programming

DISTINCT acts like a filter function that removes repeated items from a list.

Recognizing this connection helps programmers translate SQL concepts into code and vice versa.

Common Pitfalls

#1Expecting DISTINCT to remove duplicates based on a single column when multiple columns are selected.

Wrong approach:SELECT DISTINCT city, country FROM customers; -- expecting unique cities only

Correct approach:SELECT DISTINCT city FROM customers; -- to get unique cities

Root cause:Misunderstanding that DISTINCT applies to the entire row, not individual columns separately.

#2Using DISTINCT to count unique values without COUNT function.

Wrong approach:SELECT DISTINCT city FROM customers; -- expecting a count of unique cities

Correct approach:SELECT COUNT(DISTINCT city) FROM customers; -- correct way to count unique cities

Root cause:Confusing filtering unique rows with counting unique values.

#3Ignoring performance impact of DISTINCT on large tables.

Wrong approach:SELECT DISTINCT city FROM huge_customers_table; -- slow query without indexes

Correct approach:CREATE INDEX idx_city ON huge_customers_table(city); SELECT DISTINCT city FROM huge_customers_table; -- faster query

Root cause:Not considering how DISTINCT works internally and the need for indexing.

Key Takeaways

DISTINCT is used to remove duplicate rows from query results, showing only unique entries.

It applies to the entire set of selected columns, not just one column individually.

DISTINCT treats all NULL values as equal, so only one NULL appears in results.

Using DISTINCT can impact query performance, especially on large datasets without proper indexing.

For counting unique values, combine DISTINCT with COUNT, like COUNT(DISTINCT column).