Column expressions and functions help you work with data in tables easily. They let you change, calculate, or filter data in columns.
0
0
Column expressions and functions in Apache Spark
Introduction
You want to add a new column based on existing data.
You need to filter rows using conditions on column values.
You want to calculate statistics like sum or average of a column.
You want to change text or numbers in a column.
You want to combine or split columns.
Syntax
Apache Spark
from pyspark.sql.functions import col, expr, sum as _sum, avg # Using col to refer to a column col('column_name') # Using expr for expressions expr('column_name + 1') # Using functions like sum or avg _sum('column_name') avg('column_name')
col() helps you refer to a column by name.
expr() lets you write SQL-like expressions as strings.
Examples
Adds 1 to each value in the 'age' column.
Apache Spark
from pyspark.sql.functions import col df.select(col('age') + 1)
Calculates 10% of the salary for each row.
Apache Spark
from pyspark.sql.functions import expr df.select(expr('salary * 0.1'))
Finds total salary per department.
Apache Spark
from pyspark.sql.functions import sum df.groupBy('department').agg(sum('salary'))
Sample Program
This program creates a small table of people with their ages and salaries. It adds two new columns: one with age plus 5, and one with 10% bonus of salary. Then it calculates the average salary and shows the table.
Apache Spark
from pyspark.sql import SparkSession from pyspark.sql.functions import col, expr, sum as _sum, avg spark = SparkSession.builder.appName('ColumnExpressions').getOrCreate() # Create sample data data = [ (1, 'Alice', 30, 1000), (2, 'Bob', 35, 1500), (3, 'Charlie', 40, 2000), (4, 'Diana', 25, 1200) ] columns = ['id', 'name', 'age', 'salary'] df = spark.createDataFrame(data, columns) # Add a new column with age plus 5 new_df = df.withColumn('age_plus_5', col('age') + 5) # Calculate 10% bonus on salary new_df = new_df.withColumn('bonus', expr('salary * 0.1')) # Calculate average salary avg_salary = new_df.select(avg('salary')).collect()[0][0] # Show the new dataframe new_df.show() print(f'Average salary: {avg_salary}') spark.stop()
OutputSuccess
Important Notes
Use col() when you want to refer to columns in expressions clearly.
expr() is useful for complex expressions written as strings.
Functions like sum() and avg() help with quick calculations on columns.
Summary
Column expressions let you work with data inside columns easily.
You can add, change, or calculate new columns using functions.
Using these helps you analyze and prepare data for insights.