How to Use drop() in PySpark DataFrames
In PySpark, use
drop() to remove columns or rows from a DataFrame. To drop columns, pass column names as strings to drop(). To drop rows with null values, use drop() without arguments or with specific options.Syntax
The drop() method in PySpark DataFrame has two main uses:
- Drop columns:
df.drop('col1', 'col2')removes specified columns. - Drop rows with nulls:
df.drop()ordf.drop(how='any')removes rows containing null values.
Parameters for dropping rows include how ('any' or 'all'), thresh (minimum non-null values), and subset (columns to check).
python
df.drop(*cols) df.drop(how='any', thresh=None, subset=None)
Example
This example shows how to drop columns and how to drop rows with null values from a PySpark DataFrame.
python
from pyspark.sql import SparkSession from pyspark.sql import Row spark = SparkSession.builder.master('local[*]').appName('DropExample').getOrCreate() # Create sample data data = [ Row(name='Alice', age=25, city='New York'), Row(name='Bob', age=None, city='Los Angeles'), Row(name='Charlie', age=30, city=None) ] # Create DataFrame df = spark.createDataFrame(data) # Drop the 'city' column df_drop_col = df.drop('city') # Drop rows with any null values df_drop_rows = df.drop() # Show original DataFrame print('Original DataFrame:') df.show() # Show DataFrame after dropping column print('After dropping column "city":') df_drop_col.show() # Show DataFrame after dropping rows with nulls print('After dropping rows with null values:') df_drop_rows.show() spark.stop()
Output
Original DataFrame:
+-------+----+-----------+
| name| age| city|
+-------+----+-----------+
| Alice| 25| New York|
| Bob|null|Los Angeles|
|Charlie| 30| null|
+-------+----+-----------+
After dropping column "city":
+-------+----+
| name| age|
+-------+----+
| Alice| 25|
| Bob|null|
|Charlie| 30|
+-------+----+
After dropping rows with null values:
+-----+---+--------+
| name|age| city|
+-----+---+--------+
|Alice| 25|New York|
+-----+---+--------+
Common Pitfalls
Common mistakes when using drop() include:
- Trying to drop columns by passing a list instead of separate string arguments (use
df.drop('col1', 'col2'), notdf.drop(['col1', 'col2'])). - Expecting
drop()to remove rows by default without specifying conditions (it only drops rows with nulls if called without arguments). - Not specifying the
subsetparameter when dropping rows, which may lead to dropping rows based on all columns unintentionally.
python
from pyspark.sql import SparkSession spark = SparkSession.builder.master('local[*]').appName('DropPitfalls').getOrCreate() data = [(1, 2, 3), (4, None, 6), (7, 8, None)] columns = ['A', 'B', 'C'] df = spark.createDataFrame(data, columns) # Wrong: passing list instead of separate args # df_wrong = df.drop(['B', 'C']) # This raises an error # Right: pass columns as separate strings df_right = df.drop('B', 'C') # Drop rows with nulls only in column 'B' df_drop_subset = df.drop(subset=['B']) print('Original DataFrame:') df.show() print('After dropping columns B and C:') df_right.show() print('After dropping rows with nulls in column B:') df_drop_subset.show() spark.stop()
Output
Original DataFrame:
+---+----+----+
| A| B| C|
+---+----+----+
| 1| 2| 3|
| 4|null| 6|
| 7| 8|null|
+---+----+----+
After dropping columns B and C:
+---+
| A|
+---+
| 1|
| 4|
| 7|
+---+
After dropping rows with nulls in column B:
+---+---+----+
| A| B| C|
+---+---+----+
| 1| 2| 3|
| 7| 8|null|
+---+---+----+
Quick Reference
Summary tips for using drop() in PySpark:
- Use
df.drop('col1', 'col2')to remove columns by name. - Use
df.drop()to remove rows with any null values. - Use
df.drop(how='all')to drop rows where all values are null. - Use
df.drop(thresh=n)to keep rows with at leastnnon-null values. - Use
subset=['col1', 'col2']to specify columns to check for nulls when dropping rows.
Key Takeaways
Use drop() with column names as separate strings to remove columns from a DataFrame.
Calling drop() without arguments removes rows containing any null values by default.
Specify subset and how parameters to control which rows with nulls get dropped.
Passing a list instead of separate column names to drop() causes errors.
Always check your DataFrame after drop() to confirm the intended rows or columns were removed.