How to select columns pyspark

Apache-sparkHow-ToBeginner · 3 min read

How to Select Columns in PySpark: Simple Guide

In PySpark, you can select columns from a DataFrame using the select() method with column names as strings or using col() from pyspark.sql.functions. This returns a new DataFrame with only the chosen columns.

📐

Syntax

The basic syntax to select columns in PySpark is using the select() method on a DataFrame. You can pass column names as strings or use the col() function for more flexibility.

df.select('col1', 'col2'): Select columns by their names as strings.
df.select(col('col1'), col('col2')): Select columns using col() for expressions or aliasing.

python

from pyspark.sql.functions import col

df.select('column1', 'column2')
df.select(col('column1'), col('column2'))

💻

Example

This example shows how to create a simple DataFrame and select specific columns using select(). It demonstrates selecting columns by name and using col().

python

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.appName('SelectColumnsExample').getOrCreate()

# Create sample data
data = [(1, 'Alice', 29), (2, 'Bob', 31), (3, 'Cathy', 25)]
columns = ['id', 'name', 'age']

# Create DataFrame
df = spark.createDataFrame(data, columns)

# Select columns by name
selected_df1 = df.select('name', 'age')

# Select columns using col()
selected_df2 = df.select(col('id'), col('name'))

# Show results
selected_df1.show()
selected_df2.show()

Output

+-----+---+ | name|age| +-----+---+ |Alice| 29| | Bob| 31| |Cathy| 25| +-----+---+ +---+-----+ | id| name| +---+-----+ | 1|Alice| | 2| Bob| | 3|Cathy| +---+-----+

⚠️

Common Pitfalls

Common mistakes when selecting columns in PySpark include:

Passing column names as a list inside select() like df.select(['col1', 'col2']), which does not work and causes an error.
Using column names that do not exist in the DataFrame, leading to analysis errors.
Confusing select() with selectExpr(), which expects SQL expressions as strings.

python

from pyspark.sql.functions import col

# Wrong: passing list inside select
# df.select(['name', 'age'])  # This raises an error

# Right: pass columns as separate arguments
# df.select('name', 'age')

# Or use col() for expressions
# df.select(col('name'), col('age'))

📊

Quick Reference

Method	Description	Example
select()	Select columns by names or expressions	df.select('col1', 'col2')
selectExpr()	Select columns using SQL expressions	df.selectExpr('col1 as c1', 'col2 + 1')
col()	Function to refer to a column for expressions	df.select(col('col1'), col('col2'))

✅

Key Takeaways

Use df.select() with column names as separate arguments to select columns.

Use col() from pyspark.sql.functions for column expressions or aliasing.

Do not pass a list of column names inside select(), it causes errors.

Check column names exist in the DataFrame before selecting.

selectExpr() is for SQL expressions, not simple column selection.