Apache Sparkdata~10 mins

Column expressions and functions in Apache Spark - Step-by-Step Execution

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Concept Flow - Column expressions and functions

Start with DataFrame

↓

Select or create Column

↓

Apply Column Expression or Function

↓

Evaluate Expression

↓

Return new Column or DataFrame

↓

Use result for further processing or show

This flow shows how you start with a DataFrame, pick or create columns, apply expressions or functions on them, and get new columns or results for further use.

Execution Sample

Apache Spark

from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([(1, 10), (2, 20)], ['id', 'value'])
df2 = df.withColumn('value_plus_5', F.col('value') + 5)
df2.show()

This code creates a DataFrame, adds a new column by adding 5 to an existing column, and shows the result.

Execution Table

Step	Action	Expression Evaluated	Resulting Column Values
1	Create DataFrame	N/A	[{'id':1, 'value':10}, {'id':2, 'value':20}]
2	Select column 'value'	F.col('value')	[10, 20]
3	Add 5 to 'value'	F.col('value') + 5	[15, 25]
4	Add new column 'value_plus_5'	withColumn('value_plus_5', F.col('value') + 5)	[{'id':1, 'value':10, 'value_plus_5':15}, {'id':2, 'value':20, 'value_plus_5':25}]
5	Show DataFrame	df2.show()	Displays table with columns: id, value, value_plus_5

💡 All rows processed; new column added successfully.

Variable Tracker

Variable	Start	After Step 1	After Step 4	Final
df	None	[{'id':1, 'value':10}, {'id':2, 'value':20}]	[{'id':1, 'value':10}, {'id':2, 'value':20}]	[{'id':1, 'value':10}, {'id':2, 'value':20}]
df2	None	None	[{'id':1, 'value':10, 'value_plus_5':15}, {'id':2, 'value':20, 'value_plus_5':25}]	[{'id':1, 'value':10, 'value_plus_5':15}, {'id':2, 'value':20, 'value_plus_5':25}]

Key Moments - 2 Insights

Why does the original DataFrame 'df' not change after adding a new column?

What does F.col('value') + 5 actually do?

Visual Quiz - 3 Questions

Test your understanding

Look at the execution table, what are the values of the new column 'value_plus_5' after step 4?

A[15, 25]

B[10, 20]

C[5, 15]

D[1, 2]

Concept Snapshot

Column expressions in Spark allow you to create or transform columns using functions.
Use F.col('colname') to refer to columns.
Apply expressions like addition, multiplication, or built-in functions.
Use withColumn() to add or replace columns.
DataFrames are immutable; transformations return new DataFrames.

Full Transcript

This lesson shows how to use column expressions and functions in Apache Spark. We start with a DataFrame containing columns 'id' and 'value'. We select the 'value' column and create a new column by adding 5 to each value. This is done using the expression F.col('value') + 5 inside the withColumn method. The original DataFrame remains unchanged because Spark DataFrames are immutable. The new DataFrame has an additional column 'value_plus_5' with values 15 and 25. This process is lazy and evaluated when an action like show() is called. Understanding this helps you transform data efficiently in Spark.