0
0
Apache Sparkdata~10 mins

Window functions in Apache Spark - Step-by-Step Execution

Choose your learning style9 modes available
Concept Flow - Window functions
Start with DataFrame
Define Window Spec
Apply Window Function
Compute Result per Partition
Return DataFrame with New Column
End
Window functions process rows within partitions of data, computing results like ranks or sums without collapsing rows.
Execution Sample
Apache Spark
from pyspark.sql import Window
from pyspark.sql.functions import rank

windowSpec = Window.partitionBy('department').orderBy('salary')
df.withColumn('rank', rank().over(windowSpec)).show()
This code ranks employees by salary within each department.
Execution Table
StepActionInput DataWindow SpecFunction AppliedOutput ColumnResult
1Start with DataFrame[{'name':'Alice','department':'HR','salary':3000}, {'name':'Bob','department':'HR','salary':4000}, {'name':'Charlie','department':'IT','salary':3500}]partitionBy('department').orderBy('salary')rank()rankDataFrame ready for ranking
2Partition data by departmentSame as inputHR: [Alice, Bob], IT: [Charlie]rank()rankPartitions created
3Order each partition by salary ascendingHR: [Alice(3000), Bob(4000)], IT: [Charlie(3500)]Samerank()rankOrdered partitions
4Apply rank() over each partitionSameSamerank()rankHR: Alice=1, Bob=2; IT: Charlie=1
5Add rank column to DataFrameOriginal rowsSamerank()rank[{'name': 'Alice', 'department': 'HR', 'salary': 3000, 'rank': 1}, {'name': 'Bob', 'department': 'HR', 'salary': 4000, 'rank': 2}, {'name': 'Charlie', 'department': 'IT', 'salary': 3500, 'rank': 1}]
6Show final DataFrameSameSamerank()rankDisplayed with rank column
7EndFinal DataFrameSameNoneNoneExecution complete
💡 All rows processed with rank assigned per partition; execution ends.
Variable Tracker
VariableStartAfter Step 2After Step 3After Step 4After Step 5Final
dfOriginal DataFramePartitioned by departmentPartitions ordered by salaryRank computed per partitionRank column addedDataFrame with rank column
windowSpecNot definedDefined as partitionBy department, orderBy salarySameSameSameSame
rankNot definedDefined as rank() functionSameApplied over windowSpecSameSame
Key Moments - 3 Insights
Why does the rank restart for each department?
Because the windowSpec partitions data by 'department', the rank function resets for each partition as shown in execution_table step 4.
Does the window function reduce the number of rows?
No, window functions add columns but keep all original rows, as seen in execution_table step 5 where rank is added without dropping rows.
What happens if we don't specify orderBy in the windowSpec?
Without orderBy, functions like rank cannot assign meaningful order, so results may be incorrect or error occurs; ordering is essential as shown in step 3.
Visual Quiz - 3 Questions
Test your understanding
Look at the execution_table at step 4, what is the rank of Bob in the HR department?
A1
B2
C3
D0
💡 Hint
Check the 'Result' column at step 4 showing ranks per partition.
At which step is the rank column added to the DataFrame?
AStep 5
BStep 3
CStep 4
DStep 6
💡 Hint
Look for the action 'Add rank column to DataFrame' in the execution_table.
If we remove partitionBy from windowSpec, what changes in the output?
ARank restarts for each department
BDataFrame rows are reduced
CRank is computed over entire DataFrame without resetting
DError occurs immediately
💡 Hint
Partitioning controls where rank resets; without it, rank applies globally as per concept.
Concept Snapshot
Window functions compute values across rows related to the current row.
Use Window specification to define partitions and order.
Apply functions like rank(), row_number(), sum() over the window.
They add columns without reducing rows.
Useful for running totals, rankings, and moving averages.
Full Transcript
Window functions in Apache Spark let you perform calculations across sets of rows related to the current row. You start with a DataFrame, define a window specification that partitions data (like by department) and orders it (like by salary). Then you apply a window function such as rank() over this window. The function computes results per partition and adds a new column to the DataFrame without removing any rows. For example, ranking employees by salary within each department assigns ranks starting at 1 for each department. This process keeps all original data but enriches it with new insights. Key points include that partitioning controls where the function resets, ordering is necessary for ranking, and window functions do not reduce the number of rows.