0
0
Apache Sparkdata~5 mins

Column expressions and functions in Apache Spark

Choose your learning style9 modes available
Introduction

Column expressions and functions help you work with data in tables easily. They let you change, calculate, or filter data in columns.

You want to add a new column based on existing data.
You need to filter rows using conditions on column values.
You want to calculate statistics like sum or average of a column.
You want to change text or numbers in a column.
You want to combine or split columns.
Syntax
Apache Spark
from pyspark.sql.functions import col, expr, sum as _sum, avg

# Using col to refer to a column
col('column_name')

# Using expr for expressions
expr('column_name + 1')

# Using functions like sum or avg
_sum('column_name')
avg('column_name')

col() helps you refer to a column by name.

expr() lets you write SQL-like expressions as strings.

Examples
Adds 1 to each value in the 'age' column.
Apache Spark
from pyspark.sql.functions import col

df.select(col('age') + 1)
Calculates 10% of the salary for each row.
Apache Spark
from pyspark.sql.functions import expr

df.select(expr('salary * 0.1'))
Finds total salary per department.
Apache Spark
from pyspark.sql.functions import sum

df.groupBy('department').agg(sum('salary'))
Sample Program

This program creates a small table of people with their ages and salaries. It adds two new columns: one with age plus 5, and one with 10% bonus of salary. Then it calculates the average salary and shows the table.

Apache Spark
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, expr, sum as _sum, avg

spark = SparkSession.builder.appName('ColumnExpressions').getOrCreate()

# Create sample data
data = [
    (1, 'Alice', 30, 1000),
    (2, 'Bob', 35, 1500),
    (3, 'Charlie', 40, 2000),
    (4, 'Diana', 25, 1200)
]

columns = ['id', 'name', 'age', 'salary']

df = spark.createDataFrame(data, columns)

# Add a new column with age plus 5
new_df = df.withColumn('age_plus_5', col('age') + 5)

# Calculate 10% bonus on salary
new_df = new_df.withColumn('bonus', expr('salary * 0.1'))

# Calculate average salary
avg_salary = new_df.select(avg('salary')).collect()[0][0]

# Show the new dataframe
new_df.show()

print(f'Average salary: {avg_salary}')

spark.stop()
OutputSuccess
Important Notes

Use col() when you want to refer to columns in expressions clearly.

expr() is useful for complex expressions written as strings.

Functions like sum() and avg() help with quick calculations on columns.

Summary

Column expressions let you work with data inside columns easily.

You can add, change, or calculate new columns using functions.

Using these helps you analyze and prepare data for insights.