0
0
Apache Sparkdata~10 mins

Column expressions and functions in Apache Spark - Step-by-Step Execution

Choose your learning style9 modes available
Concept Flow - Column expressions and functions
Start with DataFrame
Select or create Column
Apply Column Expression or Function
Evaluate Expression
Return new Column or DataFrame
Use result for further processing or show
This flow shows how you start with a DataFrame, pick or create columns, apply expressions or functions on them, and get new columns or results for further use.
Execution Sample
Apache Spark
from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([(1, 10), (2, 20)], ['id', 'value'])
df2 = df.withColumn('value_plus_5', F.col('value') + 5)
df2.show()
This code creates a DataFrame, adds a new column by adding 5 to an existing column, and shows the result.
Execution Table
StepActionExpression EvaluatedResulting Column Values
1Create DataFrameN/A[{'id':1, 'value':10}, {'id':2, 'value':20}]
2Select column 'value'F.col('value')[10, 20]
3Add 5 to 'value'F.col('value') + 5[15, 25]
4Add new column 'value_plus_5'withColumn('value_plus_5', F.col('value') + 5)[{'id':1, 'value':10, 'value_plus_5':15}, {'id':2, 'value':20, 'value_plus_5':25}]
5Show DataFramedf2.show()Displays table with columns: id, value, value_plus_5
💡 All rows processed; new column added successfully.
Variable Tracker
VariableStartAfter Step 1After Step 4Final
dfNone[{'id':1, 'value':10}, {'id':2, 'value':20}][{'id':1, 'value':10}, {'id':2, 'value':20}][{'id':1, 'value':10}, {'id':2, 'value':20}]
df2NoneNone[{'id':1, 'value':10, 'value_plus_5':15}, {'id':2, 'value':20, 'value_plus_5':25}][{'id':1, 'value':10, 'value_plus_5':15}, {'id':2, 'value':20, 'value_plus_5':25}]
Key Moments - 2 Insights
Why does the original DataFrame 'df' not change after adding a new column?
In Spark, DataFrames are immutable. The 'withColumn' method returns a new DataFrame 'df2' with the added column, leaving 'df' unchanged, as shown in execution_table steps 1 and 4.
What does F.col('value') + 5 actually do?
It creates a new column expression that adds 5 to each value in the 'value' column. This expression is evaluated when the DataFrame action (like show) is called, as seen in steps 3 and 5.
Visual Quiz - 3 Questions
Test your understanding
Look at the execution table, what are the values of the new column 'value_plus_5' after step 4?
A[15, 25]
B[10, 20]
C[5, 15]
D[1, 2]
💡 Hint
Check the 'Resulting Column Values' in step 4 of the execution_table.
At which step is the new column 'value_plus_5' actually added to the DataFrame?
AStep 2
BStep 3
CStep 4
DStep 5
💡 Hint
Look for the action describing 'Add new column' in the execution_table.
If we changed the expression to F.col('value') * 2, what would be the 'value_plus_5' column values after step 4?
A[12, 22]
B[20, 40]
C[15, 25]
D[5, 10]
💡 Hint
Multiplying each 'value' by 2 doubles the original values; see how addition changed values in step 4.
Concept Snapshot
Column expressions in Spark allow you to create or transform columns using functions.
Use F.col('colname') to refer to columns.
Apply expressions like addition, multiplication, or built-in functions.
Use withColumn() to add or replace columns.
DataFrames are immutable; transformations return new DataFrames.
Full Transcript
This lesson shows how to use column expressions and functions in Apache Spark. We start with a DataFrame containing columns 'id' and 'value'. We select the 'value' column and create a new column by adding 5 to each value. This is done using the expression F.col('value') + 5 inside the withColumn method. The original DataFrame remains unchanged because Spark DataFrames are immutable. The new DataFrame has an additional column 'value_plus_5' with values 15 and 25. This process is lazy and evaluated when an action like show() is called. Understanding this helps you transform data efficiently in Spark.