0
0
Apache-sparkHow-ToBeginner ยท 4 min read

How to Use SQL Functions in PySpark: Simple Guide

In PySpark, you use SQL functions by importing them from pyspark.sql.functions and applying them on DataFrame columns. These functions help you perform operations like filtering, aggregating, and transforming data efficiently using select, withColumn, or agg methods.
๐Ÿ“

Syntax

To use SQL functions in PySpark, first import the functions module. Then apply functions on DataFrame columns using methods like select, withColumn, or agg.

  • import pyspark.sql.functions as F: Imports SQL functions with alias F.
  • df.select(F.function_name(df.column)): Applies a function to a column in select.
  • df.withColumn('new_col', F.function_name(df.column)): Creates or replaces a column with a function result.
  • df.groupBy('col').agg(F.function_name('col')): Aggregates data using functions.
python
from pyspark.sql import SparkSession
import pyspark.sql.functions as F

spark = SparkSession.builder.getOrCreate()

df = spark.createDataFrame(
    [(1, 'apple'), (2, 'banana'), (3, 'carrot')],
    ['id', 'fruit']
)

# Example syntax usage
result = df.select(F.upper(df.fruit).alias('fruit_upper'))
๐Ÿ’ป

Example

This example shows how to use SQL functions like upper, length, and concat to transform data in a PySpark DataFrame.

python
from pyspark.sql import SparkSession
import pyspark.sql.functions as F

spark = SparkSession.builder.getOrCreate()

data = [
    (1, 'apple'),
    (2, 'banana'),
    (3, 'carrot')
]

columns = ['id', 'fruit']

df = spark.createDataFrame(data, columns)

# Use SQL functions
transformed_df = df.select(
    'id',
    F.upper('fruit').alias('fruit_upper'),
    F.length('fruit').alias('fruit_length'),
    F.concat(F.lit('Fruit: '), F.col('fruit')).alias('fruit_label')
)

transformed_df.show()
Output
+---+-----------+------------+------------+ | id|fruit_upper|fruit_length| fruit_label| +---+-----------+------------+------------+ | 1| APPLE| 5| Fruit: apple| | 2| BANANA| 6| Fruit: banana| | 3| CARROT| 6| Fruit: carrot| +---+-----------+------------+------------+
โš ๏ธ

Common Pitfalls

Common mistakes when using SQL functions in PySpark include:

  • Not importing pyspark.sql.functions before use.
  • Passing column names as strings instead of using col() or DataFrame column objects.
  • Using Python string methods instead of PySpark SQL functions, which won't work on DataFrame columns.
  • Forgetting to alias transformed columns, leading to confusing column names.
python
from pyspark.sql import SparkSession
import pyspark.sql.functions as F

spark = SparkSession.builder.getOrCreate()

data = [(1, 'apple')]
df = spark.createDataFrame(data, ['id', 'fruit'])

# Wrong: Using Python string method (will error)
# df.select(df.fruit.upper()).show()

# Right: Use SQL function
correct_df = df.select(F.upper(df.fruit).alias('fruit_upper'))
correct_df.show()
Output
+-----------+ |fruit_upper| +-----------+ | APPLE| +-----------+
๐Ÿ“Š

Quick Reference

FunctionDescriptionExample Usage
upperConverts string to uppercaseF.upper(df.col)
lowerConverts string to lowercaseF.lower(df.col)
lengthReturns length of stringF.length(df.col)
concatConcatenates multiple columns or stringsF.concat(F.col1, F.lit('-'), F.col2)
colReturns a column object for expressionsF.col('column_name')
litCreates a literal value columnF.lit('text')
sumCalculates sum for aggregationdf.groupBy('key').agg(F.sum('value'))
โœ…

Key Takeaways

Always import SQL functions from pyspark.sql.functions before use.
Use SQL functions on DataFrame columns, not Python string methods.
Alias transformed columns for clear output names.
Use functions like upper, length, concat to manipulate data easily.
Group and aggregate data using SQL functions with groupBy and agg.