Apache-sparkHow-ToBeginner · 3 min read

How to Use Distinct in PySpark: Remove Duplicate Rows Easily

In PySpark, use the distinct() method on a DataFrame to remove duplicate rows and keep only unique records. It returns a new DataFrame with duplicates removed based on all columns.

📐

Syntax

The distinct() method is called on a PySpark DataFrame without any arguments. It returns a new DataFrame containing only unique rows.

DataFrame.distinct(): Removes duplicate rows from the DataFrame.

python

unique_df = df.distinct()

💻

Example

This example shows how to create a PySpark DataFrame with duplicate rows and then use distinct() to get only unique rows.

python

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('DistinctExample').getOrCreate()

data = [
    (1, 'apple'),
    (2, 'banana'),
    (1, 'apple'),
    (3, 'orange'),
    (2, 'banana')
]

columns = ['id', 'fruit']

df = spark.createDataFrame(data, columns)

print('Original DataFrame:')
df.show()

unique_df = df.distinct()

print('DataFrame after distinct():')
unique_df.show()

spark.stop()

Output

Original DataFrame: +---+------+ | id| fruit| +---+------+ | 1| apple| | 2|banana| | 1| apple| | 3|orange| | 2|banana| +---+------+ DataFrame after distinct(): +---+------+ | id| fruit| +---+------+ | 1| apple| | 3|orange| | 2|banana| +---+------+

⚠️

Common Pitfalls

One common mistake is expecting distinct() to remove duplicates based on only some columns. It always considers all columns in the DataFrame. To remove duplicates based on specific columns, use dropDuplicates(['col1', 'col2']) instead.

Another pitfall is forgetting that distinct() returns a new DataFrame and does not modify the original one.

python

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('DistinctPitfall').getOrCreate()

data = [
    (1, 'apple', 'red'),
    (1, 'apple', 'green'),
    (2, 'banana', 'yellow'),
    (2, 'banana', 'yellow')
]
columns = ['id', 'fruit', 'color']

df = spark.createDataFrame(data, columns)

# Using distinct() removes duplicates based on all columns
distinct_all = df.distinct()

# Using dropDuplicates on specific columns removes duplicates based on those columns only
distinct_some = df.dropDuplicates(['id', 'fruit'])

print('Original DataFrame:')
df.show()
print('After distinct():')
distinct_all.show()
print('After dropDuplicates(["id", "fruit"]):')
distinct_some.show()

spark.stop()

Output

Original DataFrame: +---+------+-----+ | id| fruit|color| +---+------+-----+ | 1| apple| red| | 1| apple|green| | 2|banana|yellow| | 2|banana|yellow| +---+------+-----+ After distinct(): +---+------+-----+ | id| fruit|color| +---+------+-----+ | 1| apple| red| | 1| apple|green| | 2|banana|yellow| +---+------+-----+ After dropDuplicates(["id", "fruit"]): +---+------+-----+ | id| fruit|color| +---+------+-----+ | 1| apple| red| | 2|banana|yellow| +---+------+-----+

📊

Quick Reference

distinct(): Removes duplicate rows considering all columns.
dropDuplicates(cols): Removes duplicates based on specified columns.
Both return new DataFrames; original DataFrame stays unchanged.

✅

Key Takeaways

Use distinct() to remove duplicate rows considering all columns in a DataFrame.

distinct() returns a new DataFrame and does not change the original one.

To remove duplicates based on specific columns, use dropDuplicates([columns]).

distinct() is simple and effective for full-row uniqueness.

Always check which columns you want to consider for duplicates before choosing the method.