String functions help you change and analyze text data easily in Spark. They make working with words and sentences simple.
0
0
String functions in Spark in Apache Spark
Introduction
You want to clean messy text data like removing spaces or changing case.
You need to find or replace parts of text in a big dataset.
You want to split a sentence into words or join words together.
You want to check if a text contains a certain word or pattern.
You want to count characters or get parts of a string.
Syntax
Apache Spark
from pyspark.sql.functions import lower, upper, trim, length, substring, concat, instr, regexp_replace # Example usage in select: df.select(lower(df.column_name), upper(df.column_name), trim(df.column_name))
Use these functions inside Spark DataFrame transformations like select or withColumn.
They work on columns, not plain Python strings.
Examples
This changes all letters in the 'name' column to lowercase.
Apache Spark
from pyspark.sql.functions import lower df.select(lower(df.name))
This removes spaces at the start and end of the 'address' text.
Apache Spark
from pyspark.sql.functions import trim df.select(trim(df.address))
This gets the first 3 characters from the 'name' column.
Apache Spark
from pyspark.sql.functions import substring df.select(substring(df.name, 1, 3))
This replaces the word 'bad' with 'good' in the 'comment' column.
Apache Spark
from pyspark.sql.functions import regexp_replace df.select(regexp_replace(df.comment, 'bad', 'good'))
Sample Program
This program creates a small table with names and comments. It then uses many string functions to clean names, change case, measure comment length, get parts of comments, replace spaces with underscores, find the word 'Spark', and create a user ID by joining text.
Apache Spark
from pyspark.sql import SparkSession from pyspark.sql.functions import lower, upper, trim, length, substring, concat, instr, regexp_replace spark = SparkSession.builder.appName('StringFunctionsExample').getOrCreate() data = [ (1, ' Alice ', 'Hello World!'), (2, 'Bob', 'Spark is fun'), (3, 'Charlie', ' Data Science ') ] columns = ['id', 'name', 'comment'] df = spark.createDataFrame(data, columns) # Use string functions result = df.select( trim(df.name).alias('trimmed_name'), lower(trim(df.name)).alias('lower_name'), upper(trim(df.name)).alias('upper_name'), length(df.comment).alias('comment_length'), substring(df.comment, 1, 5).alias('comment_start'), regexp_replace(df.comment, ' ', '_').alias('comment_no_spaces'), instr(df.comment, 'Spark').alias('spark_pos'), concat(trim(df.name), '_user').alias('user_id') ) result.show()
OutputSuccess
Important Notes
String functions in Spark work on columns, so always use them inside DataFrame operations.
Remember to import functions from pyspark.sql.functions before using them.
Some functions like instr return 0 if the text is not found.
Summary
String functions help clean and analyze text data in Spark easily.
Use them inside DataFrame select or withColumn to transform text columns.
Common functions include lower, upper, trim, substring, regexp_replace, and instr.