How to Use Distinct in PySpark: Remove Duplicate Rows Easily
In PySpark, use the
distinct() method on a DataFrame to remove duplicate rows and keep only unique records. It returns a new DataFrame with duplicates removed based on all columns.Syntax
The distinct() method is called on a PySpark DataFrame without any arguments. It returns a new DataFrame containing only unique rows.
DataFrame.distinct(): Removes duplicate rows from the DataFrame.
python
unique_df = df.distinct()
Example
This example shows how to create a PySpark DataFrame with duplicate rows and then use distinct() to get only unique rows.
python
from pyspark.sql import SparkSession spark = SparkSession.builder.appName('DistinctExample').getOrCreate() data = [ (1, 'apple'), (2, 'banana'), (1, 'apple'), (3, 'orange'), (2, 'banana') ] columns = ['id', 'fruit'] df = spark.createDataFrame(data, columns) print('Original DataFrame:') df.show() unique_df = df.distinct() print('DataFrame after distinct():') unique_df.show() spark.stop()
Output
Original DataFrame:
+---+------+
| id| fruit|
+---+------+
| 1| apple|
| 2|banana|
| 1| apple|
| 3|orange|
| 2|banana|
+---+------+
DataFrame after distinct():
+---+------+
| id| fruit|
+---+------+
| 1| apple|
| 3|orange|
| 2|banana|
+---+------+
Common Pitfalls
One common mistake is expecting distinct() to remove duplicates based on only some columns. It always considers all columns in the DataFrame. To remove duplicates based on specific columns, use dropDuplicates(['col1', 'col2']) instead.
Another pitfall is forgetting that distinct() returns a new DataFrame and does not modify the original one.
python
from pyspark.sql import SparkSession spark = SparkSession.builder.appName('DistinctPitfall').getOrCreate() data = [ (1, 'apple', 'red'), (1, 'apple', 'green'), (2, 'banana', 'yellow'), (2, 'banana', 'yellow') ] columns = ['id', 'fruit', 'color'] df = spark.createDataFrame(data, columns) # Using distinct() removes duplicates based on all columns distinct_all = df.distinct() # Using dropDuplicates on specific columns removes duplicates based on those columns only distinct_some = df.dropDuplicates(['id', 'fruit']) print('Original DataFrame:') df.show() print('After distinct():') distinct_all.show() print('After dropDuplicates(["id", "fruit"]):') distinct_some.show() spark.stop()
Output
Original DataFrame:
+---+------+-----+
| id| fruit|color|
+---+------+-----+
| 1| apple| red|
| 1| apple|green|
| 2|banana|yellow|
| 2|banana|yellow|
+---+------+-----+
After distinct():
+---+------+-----+
| id| fruit|color|
+---+------+-----+
| 1| apple| red|
| 1| apple|green|
| 2|banana|yellow|
+---+------+-----+
After dropDuplicates(["id", "fruit"]):
+---+------+-----+
| id| fruit|color|
+---+------+-----+
| 1| apple| red|
| 2|banana|yellow|
+---+------+-----+
Quick Reference
- distinct(): Removes duplicate rows considering all columns.
- dropDuplicates(cols): Removes duplicates based on specified columns.
- Both return new DataFrames; original DataFrame stays unchanged.
Key Takeaways
Use distinct() to remove duplicate rows considering all columns in a DataFrame.
distinct() returns a new DataFrame and does not change the original one.
To remove duplicates based on specific columns, use dropDuplicates([columns]).
distinct() is simple and effective for full-row uniqueness.
Always check which columns you want to consider for duplicates before choosing the method.