0
0
Apache Sparkdata~10 mins

GroupBy and aggregations in Apache Spark - Step-by-Step Execution

Choose your learning style9 modes available
Concept Flow - GroupBy and aggregations
Start with DataFrame
Choose column(s) to group by
Apply groupBy operation
Select aggregation function(s)
Compute aggregated results
Return new DataFrame with grouped and aggregated data
GroupBy splits data into groups by column values, then aggregation computes summary stats per group.
Execution Sample
Apache Spark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
data = [("A", 10), ("B", 20), ("A", 30), ("B", 40)]
df = spark.createDataFrame(data, ["Category", "Value"])
grouped = df.groupBy("Category").sum("Value")
grouped.show()
This code groups rows by 'Category' and sums the 'Value' column for each group.
Execution Table
StepActionInput DataGroupBy ColumnAggregationOutput DataFrame
1Create DataFrame[('A',10),('B',20),('A',30),('B',40)]--[('A',10),('B',20),('A',30),('B',40)]
2Apply groupBy on 'Category'[('A',10),('B',20),('A',30),('B',40)]Category-Groups: {'A': [('A',10),('A',30)], 'B': [('B',20),('B',40)]}
3Aggregate sum on 'Value'Groups: {'A': [...], 'B': [...]} Categorysum(Value)[('A',40),('B',60)]
4Show result[('A',40),('B',60)]--Output displayed as table with Category and sum(Value) columns
💡 Aggregation complete, grouped sums computed for each category.
Variable Tracker
VariableStartAfter Step 1After Step 2After Step 3Final
dfNone[('A',10),('B',20),('A',30),('B',40)][('A',10),('B',20),('A',30),('B',40)][('A',10),('B',20),('A',30),('B',40)][('A',10),('B',20),('A',30),('B',40)]
groupedNoneNoneGroups: {'A': [('A',10),('A',30)], 'B': [('B',20),('B',40)]}[('A',40),('B',60)][('A',40),('B',60)]
Key Moments - 3 Insights
Why does the output DataFrame only have one row per category?
Because groupBy collects all rows with the same category, then aggregation (sum) combines their values into one summary row per group, as shown in execution_table step 3.
What happens if we group by a column but don't apply aggregation?
The groupBy operation alone creates groups but does not compute results. Aggregation functions like sum() must be applied to get summarized output, as seen between steps 2 and 3.
Can we group by multiple columns?
Yes, grouping by multiple columns creates groups based on unique combinations of those columns. The flow is the same but groups are more specific.
Visual Quiz - 3 Questions
Test your understanding
Look at the execution_table at step 3, what is the sum of 'Value' for category 'B'?
A60
B40
C20
D80
💡 Hint
Check the 'Output DataFrame' column in step 3 of execution_table.
At which step does the DataFrame get split into groups?
AStep 1
BStep 2
CStep 3
DStep 4
💡 Hint
Look at the 'Action' and 'GroupBy Column' columns in execution_table.
If we change the aggregation from sum to count, what changes in the output DataFrame?
AOutput DataFrame stays the same
BValues show maximum value per group
CValues show total count of rows per group instead of sum
DDataFrame will have more columns
💡 Hint
Aggregation function changes the summary metric computed per group.
Concept Snapshot
GroupBy and aggregations in Spark:
- Use df.groupBy('col') to group data by column(s)
- Apply aggregation like sum(), count(), avg() on grouped data
- Result is a new DataFrame with one row per group
- Aggregations summarize data within each group
- Useful for summarizing large datasets by categories
Full Transcript
This visual execution shows how to use groupBy and aggregation in Apache Spark. First, a DataFrame is created with categories and values. Then, groupBy splits the data into groups by the 'Category' column. Next, sum aggregation adds the 'Value' numbers within each group. The final output DataFrame shows one row per category with the sum of values. Key points include that groupBy alone does not summarize data until aggregation is applied, and that multiple columns can be used to group data. The execution table traces each step clearly, and the variable tracker shows how the DataFrame and grouped data change. This helps beginners see how grouping and aggregation work step-by-step.