0
0
Apache-sparkHow-ToBeginner ยท 3 min read

How to Select Columns in PySpark: Simple Guide

In PySpark, you can select columns from a DataFrame using the select() method with column names as strings or using col() from pyspark.sql.functions. This returns a new DataFrame with only the chosen columns.
๐Ÿ“

Syntax

The basic syntax to select columns in PySpark is using the select() method on a DataFrame. You can pass column names as strings or use the col() function for more flexibility.

  • df.select('col1', 'col2'): Select columns by their names as strings.
  • df.select(col('col1'), col('col2')): Select columns using col() for expressions or aliasing.
python
from pyspark.sql.functions import col

df.select('column1', 'column2')
df.select(col('column1'), col('column2'))
๐Ÿ’ป

Example

This example shows how to create a simple DataFrame and select specific columns using select(). It demonstrates selecting columns by name and using col().

python
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.appName('SelectColumnsExample').getOrCreate()

# Create sample data
data = [(1, 'Alice', 29), (2, 'Bob', 31), (3, 'Cathy', 25)]
columns = ['id', 'name', 'age']

# Create DataFrame
df = spark.createDataFrame(data, columns)

# Select columns by name
selected_df1 = df.select('name', 'age')

# Select columns using col()
selected_df2 = df.select(col('id'), col('name'))

# Show results
selected_df1.show()
selected_df2.show()
Output
+-----+---+ | name|age| +-----+---+ |Alice| 29| | Bob| 31| |Cathy| 25| +-----+---+ +---+-----+ | id| name| +---+-----+ | 1|Alice| | 2| Bob| | 3|Cathy| +---+-----+
โš ๏ธ

Common Pitfalls

Common mistakes when selecting columns in PySpark include:

  • Passing column names as a list inside select() like df.select(['col1', 'col2']), which does not work and causes an error.
  • Using column names that do not exist in the DataFrame, leading to analysis errors.
  • Confusing select() with selectExpr(), which expects SQL expressions as strings.
python
from pyspark.sql.functions import col

# Wrong: passing list inside select
# df.select(['name', 'age'])  # This raises an error

# Right: pass columns as separate arguments
# df.select('name', 'age')

# Or use col() for expressions
# df.select(col('name'), col('age'))
๐Ÿ“Š

Quick Reference

MethodDescriptionExample
select()Select columns by names or expressionsdf.select('col1', 'col2')
selectExpr()Select columns using SQL expressionsdf.selectExpr('col1 as c1', 'col2 + 1')
col()Function to refer to a column for expressionsdf.select(col('col1'), col('col2'))
โœ…

Key Takeaways

Use df.select() with column names as separate arguments to select columns.
Use col() from pyspark.sql.functions for column expressions or aliasing.
Do not pass a list of column names inside select(), it causes errors.
Check column names exist in the DataFrame before selecting.
selectExpr() is for SQL expressions, not simple column selection.