What is GroupBy and aggregations in Apache Spark?

Apache Sparkdata~5 mins

GroupBy and aggregations in Apache Spark

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Introduction

GroupBy helps you organize data into groups based on one or more columns. Aggregations let you calculate summary values like sums or averages for each group.

You want to find total sales per product category.

You need to calculate average temperature per city.

You want to count how many users belong to each country.

You want to find the maximum score per student in a test.

Syntax

Apache Spark

df.groupBy("column_name").agg(functions)

Use groupBy to split data into groups by column(s).

Use agg to apply aggregation functions like sum, avg, count.

Examples

Groups data by 'category' and sums the 'sales' in each group.

Apache Spark

df.groupBy("category").sum("sales")

Groups data by 'city' and calculates average temperature per city.

Apache Spark

df.groupBy("city").agg({'temperature': 'avg'})

Groups data by 'country' and counts rows in each group.

Apache Spark

df.groupBy("country").count()

Sample Program

This program creates a Spark DataFrame with sales data by category. It then groups the data by category and calculates the sum, average, and maximum sales for each group. Finally, it prints the results.

Apache Spark

from pyspark.sql import SparkSession
from pyspark.sql.functions import sum, avg, max

spark = SparkSession.builder.appName("GroupByExample").getOrCreate()

# Sample data
data = [
    ("Electronics", 100),
    ("Electronics", 150),
    ("Clothing", 200),
    ("Clothing", 50),
    ("Groceries", 300)
]

# Create DataFrame
columns = ["category", "sales"]
df = spark.createDataFrame(data, columns)

# Group by category and sum sales
sum_sales = df.groupBy("category").sum("sales")

# Group by category and average sales
avg_sales = df.groupBy("category").agg(avg("sales").alias("avg_sales"))

# Group by category and max sales
max_sales = df.groupBy("category").agg(max("sales").alias("max_sales"))

# Show results
print("Sum of sales by category:")
sum_sales.show()

print("Average sales by category:")
avg_sales.show()

print("Max sales by category:")
max_sales.show()

spark.stop()

OutputSuccess

Important Notes

You can group by multiple columns by passing a list to groupBy, like groupBy(['col1', 'col2']).

Aggregation functions like sum, avg, max, min, and count are available in pyspark.sql.functions.

Always give meaningful names to aggregated columns using alias for clearer output.

Summary

GroupBy splits data into groups based on column values.

Aggregations calculate summary statistics for each group.

Use Spark functions like sum, avg, max inside agg for flexible summaries.