Apache Sparkdata~10 mins

GroupBy and aggregations in Apache Spark - Step-by-Step Execution

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Concept Flow - GroupBy and aggregations

Start with DataFrame

↓

Choose column(s) to group by

↓

Apply groupBy operation

↓

Select aggregation function(s)

↓

Compute aggregated results

↓

Return new DataFrame with grouped and aggregated data

GroupBy splits data into groups by column values, then aggregation computes summary stats per group.

Execution Sample

Apache Spark

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
data = [("A", 10), ("B", 20), ("A", 30), ("B", 40)]
df = spark.createDataFrame(data, ["Category", "Value"])
grouped = df.groupBy("Category").sum("Value")
grouped.show()

This code groups rows by 'Category' and sums the 'Value' column for each group.

Execution Table

Step	Action	Input Data	GroupBy Column	Aggregation	Output DataFrame
1	Create DataFrame	[('A',10),('B',20),('A',30),('B',40)]	-	-	[('A',10),('B',20),('A',30),('B',40)]
2	Apply groupBy on 'Category'	[('A',10),('B',20),('A',30),('B',40)]	Category	-	Groups: {'A': [('A',10),('A',30)], 'B': [('B',20),('B',40)]}
3	Aggregate sum on 'Value'	Groups: {'A': [...], 'B': [...]}	Category	sum(Value)	[('A',40),('B',60)]
4	Show result	[('A',40),('B',60)]	-	-	Output displayed as table with Category and sum(Value) columns

💡 Aggregation complete, grouped sums computed for each category.

Variable Tracker

Variable	Start	After Step 1	After Step 2	After Step 3	Final
df	None	[('A',10),('B',20),('A',30),('B',40)]	[('A',10),('B',20),('A',30),('B',40)]	[('A',10),('B',20),('A',30),('B',40)]	[('A',10),('B',20),('A',30),('B',40)]
grouped	None	None	Groups: {'A': [('A',10),('A',30)], 'B': [('B',20),('B',40)]}	[('A',40),('B',60)]	[('A',40),('B',60)]

Key Moments - 3 Insights

Why does the output DataFrame only have one row per category?

What happens if we group by a column but don't apply aggregation?

Can we group by multiple columns?

Visual Quiz - 3 Questions

Test your understanding

Look at the execution_table at step 3, what is the sum of 'Value' for category 'B'?

A60

B40

C20

D80

Concept Snapshot

GroupBy and aggregations in Spark:
- Use df.groupBy('col') to group data by column(s)
- Apply aggregation like sum(), count(), avg() on grouped data
- Result is a new DataFrame with one row per group
- Aggregations summarize data within each group
- Useful for summarizing large datasets by categories

Full Transcript

This visual execution shows how to use groupBy and aggregation in Apache Spark. First, a DataFrame is created with categories and values. Then, groupBy splits the data into groups by the 'Category' column. Next, sum aggregation adds the 'Value' numbers within each group. The final output DataFrame shows one row per category with the sum of values. Key points include that groupBy alone does not summarize data until aggregation is applied, and that multiple columns can be used to group data. The execution table traces each step clearly, and the variable tracker shows how the DataFrame and grouped data change. This helps beginners see how grouping and aggregation work step-by-step.