How to Use agg in PySpark: Syntax and Examples
In PySpark, use the
agg function on a DataFrame to perform multiple aggregate operations like sum, average, or count in one call. You pass a dictionary or expressions specifying the columns and aggregation functions to agg. This helps summarize data efficiently.Syntax
The agg function is called on a PySpark DataFrame to apply one or more aggregate functions on columns. You can pass a dictionary where keys are column names and values are aggregate functions as strings or PySpark functions.
Example syntax:
df.agg({'column1': 'sum', 'column2': 'avg'})- uses string names of functionsdf.agg(F.sum('column1'), F.avg('column2'))- uses PySpark functions imported asF
This returns a new DataFrame with the aggregated results.
python
from pyspark.sql import functions as F # Using dictionary with function names as strings df.agg({'sales': 'sum', 'quantity': 'avg'}) # Using PySpark functions directly df.agg(F.sum('sales'), F.avg('quantity'))
Example
This example shows how to create a simple DataFrame and use agg to calculate the total sales and average quantity.
python
from pyspark.sql import SparkSession from pyspark.sql import functions as F spark = SparkSession.builder.master('local').appName('aggExample').getOrCreate() # Sample data data = [ ('Alice', 10, 2), ('Bob', 20, 5), ('Alice', 15, 3), ('Bob', 5, 1) ] # Create DataFrame with columns: name, sales, quantity df = spark.createDataFrame(data, ['name', 'sales', 'quantity']) # Aggregate total sales and average quantity global_agg = df.agg(F.sum('sales').alias('total_sales'), F.avg('quantity').alias('avg_quantity')) global_agg.show()
Output
+-----------+------------+
|total_sales|avg_quantity|
+-----------+------------+
| 50| 2.75|
+-----------+------------+
Common Pitfalls
Common mistakes when using agg include:
- Passing column names without aggregation functions, which causes errors.
- Using aggregation functions without aliasing, making output columns hard to read.
- Confusing
aggwithgroupBy.aggalone aggregates the whole DataFrame, whilegroupBygroups data before aggregation.
python
from pyspark.sql import functions as F # Wrong: passing column name without aggregation # df.agg('sales') # This will raise an error # Right: use aggregation function # df.agg(F.sum('sales').alias('total_sales'))
Quick Reference
Summary tips for using agg in PySpark:
- Use
aggto apply multiple aggregate functions at once. - Pass a dictionary of column-function pairs or multiple PySpark functions.
- Alias your aggregated columns for clear output.
- Remember
aggaggregates the entire DataFrame unless combined withgroupBy.
Key Takeaways
Use
agg to perform multiple aggregations on a DataFrame in one call.Pass aggregation functions as a dictionary or PySpark function expressions.
Always alias aggregated columns for readable output.
agg without groupBy aggregates the whole DataFrame.Avoid passing column names alone without aggregation functions to
agg.