Apache-sparkHow-ToBeginner · 3 min read

How to Use agg in PySpark: Syntax and Examples

In PySpark, use the agg function on a DataFrame to perform multiple aggregate operations like sum, average, or count in one call. You pass a dictionary or expressions specifying the columns and aggregation functions to agg. This helps summarize data efficiently.

📐

Syntax

The agg function is called on a PySpark DataFrame to apply one or more aggregate functions on columns. You can pass a dictionary where keys are column names and values are aggregate functions as strings or PySpark functions.

Example syntax:

df.agg({'column1': 'sum', 'column2': 'avg'}) - uses string names of functions
df.agg(F.sum('column1'), F.avg('column2')) - uses PySpark functions imported as F

This returns a new DataFrame with the aggregated results.

python

from pyspark.sql import functions as F

# Using dictionary with function names as strings
df.agg({'sales': 'sum', 'quantity': 'avg'})

# Using PySpark functions directly
df.agg(F.sum('sales'), F.avg('quantity'))

💻

Example

This example shows how to create a simple DataFrame and use agg to calculate the total sales and average quantity.

python

from pyspark.sql import SparkSession
from pyspark.sql import functions as F

spark = SparkSession.builder.master('local').appName('aggExample').getOrCreate()

# Sample data
data = [
    ('Alice', 10, 2),
    ('Bob', 20, 5),
    ('Alice', 15, 3),
    ('Bob', 5, 1)
]

# Create DataFrame with columns: name, sales, quantity
df = spark.createDataFrame(data, ['name', 'sales', 'quantity'])

# Aggregate total sales and average quantity
global_agg = df.agg(F.sum('sales').alias('total_sales'), F.avg('quantity').alias('avg_quantity'))

global_agg.show()

Output

+-----------+------------+ |total_sales|avg_quantity| +-----------+------------+ | 50| 2.75| +-----------+------------+

⚠️

Common Pitfalls

Common mistakes when using agg include:

Passing column names without aggregation functions, which causes errors.
Using aggregation functions without aliasing, making output columns hard to read.
Confusing agg with groupBy. agg alone aggregates the whole DataFrame, while groupBy groups data before aggregation.

python

from pyspark.sql import functions as F

# Wrong: passing column name without aggregation
# df.agg('sales')  # This will raise an error

# Right: use aggregation function
# df.agg(F.sum('sales').alias('total_sales'))

📊

Quick Reference

Summary tips for using agg in PySpark:

Use agg to apply multiple aggregate functions at once.
Pass a dictionary of column-function pairs or multiple PySpark functions.
Alias your aggregated columns for clear output.
Remember agg aggregates the entire DataFrame unless combined with groupBy.

✅

Key Takeaways

Use agg to perform multiple aggregations on a DataFrame in one call.

Pass aggregation functions as a dictionary or PySpark function expressions.

Always alias aggregated columns for readable output.

agg without groupBy aggregates the whole DataFrame.

Avoid passing column names alone without aggregation functions to agg.