How to Use SQL Functions in PySpark: Simple Guide
In PySpark, you use SQL functions by importing them from
pyspark.sql.functions and applying them on DataFrame columns. These functions help you perform operations like filtering, aggregating, and transforming data efficiently using select, withColumn, or agg methods.Syntax
To use SQL functions in PySpark, first import the functions module. Then apply functions on DataFrame columns using methods like select, withColumn, or agg.
import pyspark.sql.functions as F: Imports SQL functions with aliasF.df.select(F.function_name(df.column)): Applies a function to a column inselect.df.withColumn('new_col', F.function_name(df.column)): Creates or replaces a column with a function result.df.groupBy('col').agg(F.function_name('col')): Aggregates data using functions.
python
from pyspark.sql import SparkSession import pyspark.sql.functions as F spark = SparkSession.builder.getOrCreate() df = spark.createDataFrame( [(1, 'apple'), (2, 'banana'), (3, 'carrot')], ['id', 'fruit'] ) # Example syntax usage result = df.select(F.upper(df.fruit).alias('fruit_upper'))
Example
This example shows how to use SQL functions like upper, length, and concat to transform data in a PySpark DataFrame.
python
from pyspark.sql import SparkSession import pyspark.sql.functions as F spark = SparkSession.builder.getOrCreate() data = [ (1, 'apple'), (2, 'banana'), (3, 'carrot') ] columns = ['id', 'fruit'] df = spark.createDataFrame(data, columns) # Use SQL functions transformed_df = df.select( 'id', F.upper('fruit').alias('fruit_upper'), F.length('fruit').alias('fruit_length'), F.concat(F.lit('Fruit: '), F.col('fruit')).alias('fruit_label') ) transformed_df.show()
Output
+---+-----------+------------+------------+
| id|fruit_upper|fruit_length| fruit_label|
+---+-----------+------------+------------+
| 1| APPLE| 5| Fruit: apple|
| 2| BANANA| 6| Fruit: banana|
| 3| CARROT| 6| Fruit: carrot|
+---+-----------+------------+------------+
Common Pitfalls
Common mistakes when using SQL functions in PySpark include:
- Not importing
pyspark.sql.functionsbefore use. - Passing column names as strings instead of using
col()or DataFrame column objects. - Using Python string methods instead of PySpark SQL functions, which won't work on DataFrame columns.
- Forgetting to alias transformed columns, leading to confusing column names.
python
from pyspark.sql import SparkSession import pyspark.sql.functions as F spark = SparkSession.builder.getOrCreate() data = [(1, 'apple')] df = spark.createDataFrame(data, ['id', 'fruit']) # Wrong: Using Python string method (will error) # df.select(df.fruit.upper()).show() # Right: Use SQL function correct_df = df.select(F.upper(df.fruit).alias('fruit_upper')) correct_df.show()
Output
+-----------+
|fruit_upper|
+-----------+
| APPLE|
+-----------+
Quick Reference
| Function | Description | Example Usage |
|---|---|---|
| upper | Converts string to uppercase | F.upper(df.col) |
| lower | Converts string to lowercase | F.lower(df.col) |
| length | Returns length of string | F.length(df.col) |
| concat | Concatenates multiple columns or strings | F.concat(F.col1, F.lit('-'), F.col2) |
| col | Returns a column object for expressions | F.col('column_name') |
| lit | Creates a literal value column | F.lit('text') |
| sum | Calculates sum for aggregation | df.groupBy('key').agg(F.sum('value')) |
Key Takeaways
Always import SQL functions from pyspark.sql.functions before use.
Use SQL functions on DataFrame columns, not Python string methods.
Alias transformed columns for clear output names.
Use functions like upper, length, concat to manipulate data easily.
Group and aggregate data using SQL functions with groupBy and agg.