0
0
Apache-sparkHow-ToBeginner ยท 4 min read

How to Use drop() in PySpark DataFrames

In PySpark, use drop() to remove columns or rows from a DataFrame. To drop columns, pass column names as strings to drop(). To drop rows with null values, use drop() without arguments or with specific options.
๐Ÿ“

Syntax

The drop() method in PySpark DataFrame has two main uses:

  • Drop columns: df.drop('col1', 'col2') removes specified columns.
  • Drop rows with nulls: df.drop() or df.drop(how='any') removes rows containing null values.

Parameters for dropping rows include how ('any' or 'all'), thresh (minimum non-null values), and subset (columns to check).

python
df.drop(*cols)
df.drop(how='any', thresh=None, subset=None)
๐Ÿ’ป

Example

This example shows how to drop columns and how to drop rows with null values from a PySpark DataFrame.

python
from pyspark.sql import SparkSession
from pyspark.sql import Row

spark = SparkSession.builder.master('local[*]').appName('DropExample').getOrCreate()

# Create sample data
data = [
    Row(name='Alice', age=25, city='New York'),
    Row(name='Bob', age=None, city='Los Angeles'),
    Row(name='Charlie', age=30, city=None)
]

# Create DataFrame
df = spark.createDataFrame(data)

# Drop the 'city' column
df_drop_col = df.drop('city')

# Drop rows with any null values
df_drop_rows = df.drop()

# Show original DataFrame
print('Original DataFrame:')
df.show()

# Show DataFrame after dropping column
print('After dropping column "city":')
df_drop_col.show()

# Show DataFrame after dropping rows with nulls
print('After dropping rows with null values:')
df_drop_rows.show()

spark.stop()
Output
Original DataFrame: +-------+----+-----------+ | name| age| city| +-------+----+-----------+ | Alice| 25| New York| | Bob|null|Los Angeles| |Charlie| 30| null| +-------+----+-----------+ After dropping column "city": +-------+----+ | name| age| +-------+----+ | Alice| 25| | Bob|null| |Charlie| 30| +-------+----+ After dropping rows with null values: +-----+---+--------+ | name|age| city| +-----+---+--------+ |Alice| 25|New York| +-----+---+--------+
โš ๏ธ

Common Pitfalls

Common mistakes when using drop() include:

  • Trying to drop columns by passing a list instead of separate string arguments (use df.drop('col1', 'col2'), not df.drop(['col1', 'col2'])).
  • Expecting drop() to remove rows by default without specifying conditions (it only drops rows with nulls if called without arguments).
  • Not specifying the subset parameter when dropping rows, which may lead to dropping rows based on all columns unintentionally.
python
from pyspark.sql import SparkSession

spark = SparkSession.builder.master('local[*]').appName('DropPitfalls').getOrCreate()

data = [(1, 2, 3), (4, None, 6), (7, 8, None)]
columns = ['A', 'B', 'C']
df = spark.createDataFrame(data, columns)

# Wrong: passing list instead of separate args
# df_wrong = df.drop(['B', 'C'])  # This raises an error

# Right: pass columns as separate strings
 df_right = df.drop('B', 'C')

# Drop rows with nulls only in column 'B'
df_drop_subset = df.drop(subset=['B'])

print('Original DataFrame:')
df.show()
print('After dropping columns B and C:')
df_right.show()
print('After dropping rows with nulls in column B:')
df_drop_subset.show()

spark.stop()
Output
Original DataFrame: +---+----+----+ | A| B| C| +---+----+----+ | 1| 2| 3| | 4|null| 6| | 7| 8|null| +---+----+----+ After dropping columns B and C: +---+ | A| +---+ | 1| | 4| | 7| +---+ After dropping rows with nulls in column B: +---+---+----+ | A| B| C| +---+---+----+ | 1| 2| 3| | 7| 8|null| +---+---+----+
๐Ÿ“Š

Quick Reference

Summary tips for using drop() in PySpark:

  • Use df.drop('col1', 'col2') to remove columns by name.
  • Use df.drop() to remove rows with any null values.
  • Use df.drop(how='all') to drop rows where all values are null.
  • Use df.drop(thresh=n) to keep rows with at least n non-null values.
  • Use subset=['col1', 'col2'] to specify columns to check for nulls when dropping rows.
โœ…

Key Takeaways

Use drop() with column names as separate strings to remove columns from a DataFrame.
Calling drop() without arguments removes rows containing any null values by default.
Specify subset and how parameters to control which rows with nulls get dropped.
Passing a list instead of separate column names to drop() causes errors.
Always check your DataFrame after drop() to confirm the intended rows or columns were removed.