Overview - DISTINCT for unique values

What is it?

DISTINCT is a keyword in SQL used to find unique values in a column or combination of columns. It removes duplicate rows from the result of a query, showing each unique value only once. This helps when you want to see all different entries without repeats. It works by scanning the data and filtering out repeated records.

Why it matters

Without DISTINCT, queries would return all rows including duplicates, making it hard to understand the variety of data. For example, if you want to know all unique cities where customers live, duplicates would clutter the list. DISTINCT helps clean up results, making data easier to analyze and decisions clearer. It saves time and avoids mistakes caused by repeated data.

Where it fits

Before learning DISTINCT, you should understand basic SQL SELECT queries and how to retrieve data from tables. After DISTINCT, you can learn about GROUP BY for grouping data and aggregate functions like COUNT or SUM. DISTINCT is a foundational tool for data filtering and cleaning in SQL.

Mental Model

Core Idea

DISTINCT filters query results to show only unique rows, removing duplicates.

Think of it like...

Imagine you have a bag of mixed colored marbles and you want to see each color only once. DISTINCT is like picking out one marble of each color and ignoring the rest.

┌───────────────┐
│ Original Data │
│ Red           │
│ Blue          │
│ Red           │
│ Green         │
│ Blue          │
└──────┬────────┘
       │ Apply DISTINCT
       ▼
┌───────────────┐
│ Unique Colors │
│ Red           │
│ Blue          │
│ Green         │
└───────────────┘

Build-Up - 7 Steps

1

FoundationBasic SELECT Query Review

Concept: Understanding how to retrieve data from a table using SELECT.

A SELECT query fetches rows from a table. For example, SELECT city FROM customers; returns all city values, including duplicates.

Result

A list of all city names from the customers table, with repeats.

Knowing how SELECT works is essential before filtering duplicates with DISTINCT.

2

FoundationWhat Causes Duplicate Rows?

3

IntermediateUsing DISTINCT to Remove Duplicates

4

IntermediateDISTINCT with Multiple Columns

5

IntermediateDISTINCT vs GROUP BY

6

AdvancedPerformance Considerations of DISTINCT

7

ExpertDISTINCT with NULL Values Behavior

Under the Hood

When you use DISTINCT, the database engine scans the selected rows and compares them to find duplicates. It often sorts the data or uses hashing to group identical rows together. Then it returns only one row from each group. This process happens after the data is fetched but before the final result is sent to you.

Why designed this way?

DISTINCT was designed to simplify the common need to find unique values without writing complex code. Sorting or hashing is efficient for grouping duplicates. Alternatives like manual filtering would be slower and more error-prone. The design balances simplicity for users and performance for databases.

┌───────────────┐
│ Query Result  │
│ Row 1        │
│ Row 2        │
│ Row 3        │
│ ...          │
└──────┬────────┘
       │ Sort or Hash
       ▼
┌───────────────┐
│ Grouped Rows  │
│ Unique Row 1  │
│ Unique Row 2  │
│ Unique Row 3  │
└──────┬────────┘
       │ Return
       ▼
┌───────────────┐
│ Final Output  │
│ Unique Rows   │
└───────────────┘

Myth Busters - 4 Common Misconceptions

Quick: Does DISTINCT remove duplicates from each column separately or from the entire row? Commit to your answer.

Common Belief:DISTINCT removes duplicates from each column independently.

Tap to reveal reality

Quick: Does DISTINCT treat NULL values as different or the same? Commit to your answer.

Common Belief:DISTINCT treats each NULL as a unique value, so multiple NULLs appear in results.

Tap to reveal reality

Quick: Is DISTINCT always the fastest way to get unique values? Commit to your answer.

Common Belief:DISTINCT is always fast and efficient for removing duplicates.

Tap to reveal reality

Quick: Does DISTINCT change the order of rows in the result? Commit to your answer.

Common Belief:DISTINCT preserves the original order of rows in the table.

Tap to reveal reality

Expert Zone

1

DISTINCT can be combined with ORDER BY to control output order, but ORDER BY happens after duplicates are removed.

2

Using DISTINCT on multiple columns can hide duplicates in individual columns if the combination is unique.

3

Some databases optimize DISTINCT with indexes, but others may perform full scans, affecting performance.

When NOT to use

Avoid DISTINCT when you need aggregated summaries; use GROUP BY instead. Also, if performance is critical on large datasets, consider indexing or alternative query designs. For filtering duplicates in complex joins, window functions might be better.

Production Patterns

In real systems, DISTINCT is often used to populate dropdown lists with unique options, clean data before reporting, or validate uniqueness constraints. It is combined with filters and joins to extract meaningful unique sets from large tables.

Connections

Set Theory

DISTINCT corresponds to the concept of a set containing unique elements.

Understanding DISTINCT as creating a set helps grasp why duplicates are removed and how SQL results relate to mathematical sets.

Data Cleaning

DISTINCT is a basic tool for cleaning data by removing repeated entries.

Knowing how DISTINCT works aids in preparing datasets for analysis by ensuring uniqueness where needed.

Hashing Algorithms

Databases often use hashing internally to detect duplicates efficiently when applying DISTINCT.

Recognizing the role of hashing explains performance characteristics and optimization opportunities.

Common Pitfalls

#1Expecting DISTINCT to remove duplicates from each column separately.

Wrong approach:SELECT DISTINCT city, state FROM customers; -- expecting unique cities and unique states independently

Correct approach:SELECT DISTINCT city FROM customers; -- to get unique cities only

Root cause:Misunderstanding that DISTINCT works on the combination of all selected columns, not each column individually.

#2Using DISTINCT on large tables without indexes causing slow queries.

Wrong approach:SELECT DISTINCT product_name, category FROM large_products_table;

Correct approach:Create an index on (product_name, category) before running DISTINCT or use GROUP BY with indexed columns.

Root cause:Not considering query performance and database indexing when using DISTINCT.

#3Assuming DISTINCT preserves row order.

Wrong approach:SELECT DISTINCT city FROM customers ORDER BY NULL; -- expecting original order

Correct approach:SELECT DISTINCT city FROM customers ORDER BY city; -- explicitly ordering results

Root cause:Not knowing that DISTINCT does not guarantee order and that ORDER BY is needed to control output sequence.

Key Takeaways

DISTINCT is used to remove duplicate rows from query results, showing only unique values.

It works on the entire set of selected columns combined, not on each column separately.

DISTINCT treats all NULL values as equal, returning only one NULL in results.

Using DISTINCT can impact query performance, especially on large datasets without proper indexing.

Understanding DISTINCT helps clean data, prepare unique lists, and avoid common SQL mistakes.