0
0
Apache Sparkdata~5 mins

String functions in Spark in Apache Spark

Choose your learning style9 modes available
Introduction

String functions help you change and analyze text data easily in Spark. They make working with words and sentences simple.

You want to clean messy text data like removing spaces or changing case.
You need to find or replace parts of text in a big dataset.
You want to split a sentence into words or join words together.
You want to check if a text contains a certain word or pattern.
You want to count characters or get parts of a string.
Syntax
Apache Spark
from pyspark.sql.functions import lower, upper, trim, length, substring, concat, instr, regexp_replace

# Example usage in select:
df.select(lower(df.column_name), upper(df.column_name), trim(df.column_name))
Use these functions inside Spark DataFrame transformations like select or withColumn.
They work on columns, not plain Python strings.
Examples
This changes all letters in the 'name' column to lowercase.
Apache Spark
from pyspark.sql.functions import lower

df.select(lower(df.name))
This removes spaces at the start and end of the 'address' text.
Apache Spark
from pyspark.sql.functions import trim

df.select(trim(df.address))
This gets the first 3 characters from the 'name' column.
Apache Spark
from pyspark.sql.functions import substring

df.select(substring(df.name, 1, 3))
This replaces the word 'bad' with 'good' in the 'comment' column.
Apache Spark
from pyspark.sql.functions import regexp_replace

df.select(regexp_replace(df.comment, 'bad', 'good'))
Sample Program

This program creates a small table with names and comments. It then uses many string functions to clean names, change case, measure comment length, get parts of comments, replace spaces with underscores, find the word 'Spark', and create a user ID by joining text.

Apache Spark
from pyspark.sql import SparkSession
from pyspark.sql.functions import lower, upper, trim, length, substring, concat, instr, regexp_replace

spark = SparkSession.builder.appName('StringFunctionsExample').getOrCreate()

data = [
    (1, ' Alice ', 'Hello World!'),
    (2, 'Bob', 'Spark is fun'),
    (3, 'Charlie', '  Data Science  ')
]

columns = ['id', 'name', 'comment']
df = spark.createDataFrame(data, columns)

# Use string functions
result = df.select(
    trim(df.name).alias('trimmed_name'),
    lower(trim(df.name)).alias('lower_name'),
    upper(trim(df.name)).alias('upper_name'),
    length(df.comment).alias('comment_length'),
    substring(df.comment, 1, 5).alias('comment_start'),
    regexp_replace(df.comment, ' ', '_').alias('comment_no_spaces'),
    instr(df.comment, 'Spark').alias('spark_pos'),
    concat(trim(df.name), '_user').alias('user_id')
)

result.show()
OutputSuccess
Important Notes

String functions in Spark work on columns, so always use them inside DataFrame operations.

Remember to import functions from pyspark.sql.functions before using them.

Some functions like instr return 0 if the text is not found.

Summary

String functions help clean and analyze text data in Spark easily.

Use them inside DataFrame select or withColumn to transform text columns.

Common functions include lower, upper, trim, substring, regexp_replace, and instr.