Concept Flow - Window functions

Start with DataFrame

↓

Define Window Spec

↓

Apply Window Function

↓

Compute Result per Partition

↓

Return DataFrame with New Column

↓

End

Window functions process rows within partitions of data, computing results like ranks or sums without collapsing rows.

Execution Sample

Apache Spark

from pyspark.sql import Window
from pyspark.sql.functions import rank

windowSpec = Window.partitionBy('department').orderBy('salary')
df.withColumn('rank', rank().over(windowSpec)).show()

This code ranks employees by salary within each department.

Execution Table

Step	Action	Input Data	Window Spec	Function Applied	Output Column	Result
1	Start with DataFrame	[{'name':'Alice','department':'HR','salary':3000}, {'name':'Bob','department':'HR','salary':4000}, {'name':'Charlie','department':'IT','salary':3500}]	partitionBy('department').orderBy('salary')	rank()	rank	DataFrame ready for ranking
2	Partition data by department	Same as input	HR: [Alice, Bob], IT: [Charlie]	rank()	rank	Partitions created
3	Order each partition by salary ascending	HR: [Alice(3000), Bob(4000)], IT: [Charlie(3500)]	Same	rank()	rank	Ordered partitions
4	Apply rank() over each partition	Same	Same	rank()	rank	HR: Alice=1, Bob=2; IT: Charlie=1
5	Add rank column to DataFrame	Original rows	Same	rank()	rank	[{'name': 'Alice', 'department': 'HR', 'salary': 3000, 'rank': 1}, {'name': 'Bob', 'department': 'HR', 'salary': 4000, 'rank': 2}, {'name': 'Charlie', 'department': 'IT', 'salary': 3500, 'rank': 1}]
6	Show final DataFrame	Same	Same	rank()	rank	Displayed with rank column
7	End	Final DataFrame	Same	None	None	Execution complete

💡 All rows processed with rank assigned per partition; execution ends.

Variable Tracker

Variable	Start	After Step 2	After Step 3	After Step 4	After Step 5	Final
df	Original DataFrame	Partitioned by department	Partitions ordered by salary	Rank computed per partition	Rank column added	DataFrame with rank column
windowSpec	Not defined	Defined as partitionBy department, orderBy salary	Same	Same	Same	Same
rank	Not defined	Defined as rank() function	Same	Applied over windowSpec	Same	Same

Key Moments - 3 Insights

Why does the rank restart for each department?

Does the window function reduce the number of rows?

What happens if we don't specify orderBy in the windowSpec?

Visual Quiz - 3 Questions

Test your understanding

Look at the execution_table at step 4, what is the rank of Bob in the HR department?

A1

B2

C3

D0

Concept Snapshot

Window functions compute values across rows related to the current row.
Use Window specification to define partitions and order.
Apply functions like rank(), row_number(), sum() over the window.
They add columns without reducing rows.
Useful for running totals, rankings, and moving averages.

Full Transcript

Window functions in Apache Spark let you perform calculations across sets of rows related to the current row. You start with a DataFrame, define a window specification that partitions data (like by department) and orders it (like by salary). Then you apply a window function such as rank() over this window. The function computes results per partition and adds a new column to the DataFrame without removing any rows. For example, ranking employees by salary within each department assigns ranks starting at 1 for each department. This process keeps all original data but enriches it with new insights. Key points include that partitioning controls where the function resets, ordering is necessary for ranking, and window functions do not reduce the number of rows.