How to Select Columns in PySpark: Simple Guide
In PySpark, you can select columns from a DataFrame using the
select() method with column names as strings or using col() from pyspark.sql.functions. This returns a new DataFrame with only the chosen columns.Syntax
The basic syntax to select columns in PySpark is using the select() method on a DataFrame. You can pass column names as strings or use the col() function for more flexibility.
df.select('col1', 'col2'): Select columns by their names as strings.df.select(col('col1'), col('col2')): Select columns usingcol()for expressions or aliasing.
python
from pyspark.sql.functions import col df.select('column1', 'column2') df.select(col('column1'), col('column2'))
Example
This example shows how to create a simple DataFrame and select specific columns using select(). It demonstrates selecting columns by name and using col().
python
from pyspark.sql import SparkSession from pyspark.sql.functions import col spark = SparkSession.builder.appName('SelectColumnsExample').getOrCreate() # Create sample data data = [(1, 'Alice', 29), (2, 'Bob', 31), (3, 'Cathy', 25)] columns = ['id', 'name', 'age'] # Create DataFrame df = spark.createDataFrame(data, columns) # Select columns by name selected_df1 = df.select('name', 'age') # Select columns using col() selected_df2 = df.select(col('id'), col('name')) # Show results selected_df1.show() selected_df2.show()
Output
+-----+---+
| name|age|
+-----+---+
|Alice| 29|
| Bob| 31|
|Cathy| 25|
+-----+---+
+---+-----+
| id| name|
+---+-----+
| 1|Alice|
| 2| Bob|
| 3|Cathy|
+---+-----+
Common Pitfalls
Common mistakes when selecting columns in PySpark include:
- Passing column names as a list inside
select()likedf.select(['col1', 'col2']), which does not work and causes an error. - Using column names that do not exist in the DataFrame, leading to analysis errors.
- Confusing
select()withselectExpr(), which expects SQL expressions as strings.
python
from pyspark.sql.functions import col # Wrong: passing list inside select # df.select(['name', 'age']) # This raises an error # Right: pass columns as separate arguments # df.select('name', 'age') # Or use col() for expressions # df.select(col('name'), col('age'))
Quick Reference
| Method | Description | Example |
|---|---|---|
| select() | Select columns by names or expressions | df.select('col1', 'col2') |
| selectExpr() | Select columns using SQL expressions | df.selectExpr('col1 as c1', 'col2 + 1') |
| col() | Function to refer to a column for expressions | df.select(col('col1'), col('col2')) |
Key Takeaways
Use df.select() with column names as separate arguments to select columns.
Use col() from pyspark.sql.functions for column expressions or aliasing.
Do not pass a list of column names inside select(), it causes errors.
Check column names exist in the DataFrame before selecting.
selectExpr() is for SQL expressions, not simple column selection.