0
0
Apache-sparkHow-ToBeginner ยท 3 min read

How to Use agg in PySpark: Syntax and Examples

In PySpark, use the agg function on a DataFrame to perform multiple aggregate operations like sum, average, or count in one call. You pass a dictionary or expressions specifying the columns and aggregation functions to agg. This helps summarize data efficiently.
๐Ÿ“

Syntax

The agg function is called on a PySpark DataFrame to apply one or more aggregate functions on columns. You can pass a dictionary where keys are column names and values are aggregate functions as strings or PySpark functions.

Example syntax:

  • df.agg({'column1': 'sum', 'column2': 'avg'}) - uses string names of functions
  • df.agg(F.sum('column1'), F.avg('column2')) - uses PySpark functions imported as F

This returns a new DataFrame with the aggregated results.

python
from pyspark.sql import functions as F

# Using dictionary with function names as strings
df.agg({'sales': 'sum', 'quantity': 'avg'})

# Using PySpark functions directly
df.agg(F.sum('sales'), F.avg('quantity'))
๐Ÿ’ป

Example

This example shows how to create a simple DataFrame and use agg to calculate the total sales and average quantity.

python
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

spark = SparkSession.builder.master('local').appName('aggExample').getOrCreate()

# Sample data
data = [
    ('Alice', 10, 2),
    ('Bob', 20, 5),
    ('Alice', 15, 3),
    ('Bob', 5, 1)
]

# Create DataFrame with columns: name, sales, quantity
df = spark.createDataFrame(data, ['name', 'sales', 'quantity'])

# Aggregate total sales and average quantity
global_agg = df.agg(F.sum('sales').alias('total_sales'), F.avg('quantity').alias('avg_quantity'))

global_agg.show()
Output
+-----------+------------+ |total_sales|avg_quantity| +-----------+------------+ | 50| 2.75| +-----------+------------+
โš ๏ธ

Common Pitfalls

Common mistakes when using agg include:

  • Passing column names without aggregation functions, which causes errors.
  • Using aggregation functions without aliasing, making output columns hard to read.
  • Confusing agg with groupBy. agg alone aggregates the whole DataFrame, while groupBy groups data before aggregation.
python
from pyspark.sql import functions as F

# Wrong: passing column name without aggregation
# df.agg('sales')  # This will raise an error

# Right: use aggregation function
# df.agg(F.sum('sales').alias('total_sales'))
๐Ÿ“Š

Quick Reference

Summary tips for using agg in PySpark:

  • Use agg to apply multiple aggregate functions at once.
  • Pass a dictionary of column-function pairs or multiple PySpark functions.
  • Alias your aggregated columns for clear output.
  • Remember agg aggregates the entire DataFrame unless combined with groupBy.
โœ…

Key Takeaways

Use agg to perform multiple aggregations on a DataFrame in one call.
Pass aggregation functions as a dictionary or PySpark function expressions.
Always alias aggregated columns for readable output.
agg without groupBy aggregates the whole DataFrame.
Avoid passing column names alone without aggregation functions to agg.