Apache-sparkHow-ToBeginner · 4 min read

How to Use drop() in PySpark DataFrames

In PySpark, use drop() to remove columns or rows from a DataFrame. To drop columns, pass column names as strings to drop(). To drop rows with null values, use drop() without arguments or with specific options.

📐

Syntax

The drop() method in PySpark DataFrame has two main uses:

Drop columns: df.drop('col1', 'col2') removes specified columns.
Drop rows with nulls: df.drop() or df.drop(how='any') removes rows containing null values.

Parameters for dropping rows include how ('any' or 'all'), thresh (minimum non-null values), and subset (columns to check).

python

df.drop(*cols)
df.drop(how='any', thresh=None, subset=None)

💻

Example

This example shows how to drop columns and how to drop rows with null values from a PySpark DataFrame.

python

from pyspark.sql import SparkSession
from pyspark.sql import Row

spark = SparkSession.builder.master('local[*]').appName('DropExample').getOrCreate()

# Create sample data
data = [
    Row(name='Alice', age=25, city='New York'),
    Row(name='Bob', age=None, city='Los Angeles'),
    Row(name='Charlie', age=30, city=None)
]

# Create DataFrame
df = spark.createDataFrame(data)

# Drop the 'city' column
df_drop_col = df.drop('city')

# Drop rows with any null values
df_drop_rows = df.drop()

# Show original DataFrame
print('Original DataFrame:')
df.show()

# Show DataFrame after dropping column
print('After dropping column "city":')
df_drop_col.show()

# Show DataFrame after dropping rows with nulls
print('After dropping rows with null values:')
df_drop_rows.show()

spark.stop()

Output

Original DataFrame: +-------+----+-----------+ | name| age| city| +-------+----+-----------+ | Alice| 25| New York| | Bob|null|Los Angeles| |Charlie| 30| null| +-------+----+-----------+ After dropping column "city": +-------+----+ | name| age| +-------+----+ | Alice| 25| | Bob|null| |Charlie| 30| +-------+----+ After dropping rows with null values: +-----+---+--------+ | name|age| city| +-----+---+--------+ |Alice| 25|New York| +-----+---+--------+

⚠️

Common Pitfalls

Common mistakes when using drop() include:

Trying to drop columns by passing a list instead of separate string arguments (use df.drop('col1', 'col2'), not df.drop(['col1', 'col2'])).
Expecting drop() to remove rows by default without specifying conditions (it only drops rows with nulls if called without arguments).
Not specifying the subset parameter when dropping rows, which may lead to dropping rows based on all columns unintentionally.

python

from pyspark.sql import SparkSession

spark = SparkSession.builder.master('local[*]').appName('DropPitfalls').getOrCreate()

data = [(1, 2, 3), (4, None, 6), (7, 8, None)]
columns = ['A', 'B', 'C']
df = spark.createDataFrame(data, columns)

# Wrong: passing list instead of separate args
# df_wrong = df.drop(['B', 'C'])  # This raises an error

# Right: pass columns as separate strings
 df_right = df.drop('B', 'C')

# Drop rows with nulls only in column 'B'
df_drop_subset = df.drop(subset=['B'])

print('Original DataFrame:')
df.show()
print('After dropping columns B and C:')
df_right.show()
print('After dropping rows with nulls in column B:')
df_drop_subset.show()

spark.stop()

Output

Original DataFrame: +---+----+----+ | A| B| C| +---+----+----+ | 1| 2| 3| | 4|null| 6| | 7| 8|null| +---+----+----+ After dropping columns B and C: +---+ | A| +---+ | 1| | 4| | 7| +---+ After dropping rows with nulls in column B: +---+---+----+ | A| B| C| +---+---+----+ | 1| 2| 3| | 7| 8|null| +---+---+----+

📊

Quick Reference

Summary tips for using drop() in PySpark:

Use df.drop('col1', 'col2') to remove columns by name.
Use df.drop() to remove rows with any null values.
Use df.drop(how='all') to drop rows where all values are null.
Use df.drop(thresh=n) to keep rows with at least n non-null values.
Use subset=['col1', 'col2'] to specify columns to check for nulls when dropping rows.

✅

Key Takeaways

Use drop() with column names as separate strings to remove columns from a DataFrame.

Calling drop() without arguments removes rows containing any null values by default.

Specify subset and how parameters to control which rows with nulls get dropped.

Passing a list instead of separate column names to drop() causes errors.

Always check your DataFrame after drop() to confirm the intended rows or columns were removed.