0
0
Apache Sparkdata~10 mins

Select, filter, and where operations in Apache Spark - Step-by-Step Execution

Choose your learning style9 modes available
Concept Flow - Select, filter, and where operations
Start with DataFrame
Select columns
Filtered DataFrame
Apply filter/where condition
Result DataFrame with rows matching condition
End
Start with a DataFrame, select columns to keep, then filter rows using conditions to get the final result.
Execution Sample
Apache Spark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
data = [(1, 'Alice', 20), (2, 'Bob', 30), (3, 'Cathy', 25)]
df = spark.createDataFrame(data, ['id', 'name', 'age'])
df.select('name', 'age').filter(df.age > 20).show()
Create a DataFrame, select 'name' and 'age' columns, then filter rows where age is greater than 20.
Execution Table
StepActionDataFrame StateOutput
1Create DataFrame[{id:1,name:'Alice',age:20}, {id:2,name:'Bob',age:30}, {id:3,name:'Cathy',age:25}]Full DataFrame with 3 rows
2Select columns 'name' and 'age'[{name:'Alice',age:20}, {name:'Bob',age:30}, {name:'Cathy',age:25}]DataFrame with 2 columns, 3 rows
3Filter rows where age > 20[{name:'Bob',age:30}, {name:'Cathy',age:25}]Filtered DataFrame with 2 rows
4Show resultPrint rowsname: Bob, age: 30 name: Cathy, age: 25
5EndNo further actionExecution complete
💡 Filtering stops when all rows are checked; only rows with age > 20 remain.
Variable Tracker
VariableStartAfter Step 1After Step 2After Step 3Final
dfNone[{id:1,name:'Alice',age:20}, {id:2,name:'Bob',age:30}, {id:3,name:'Cathy',age:25}][{id:1,name:'Alice',age:20}, {id:2,name:'Bob',age:30}, {id:3,name:'Cathy',age:25}][{name:'Bob',age:30}, {name:'Cathy',age:25}][{name:'Bob',age:30}, {name:'Cathy',age:25}]
Key Moments - 3 Insights
Why does the filter keep only rows where age is greater than 20?
Because the filter condition df.age > 20 is applied to each row, only rows satisfying this condition remain, as shown in execution_table step 3.
What happens if we select columns after filtering instead of before?
Selecting columns after filtering still works but changes the order of operations; the final output is the same but intermediate DataFrame states differ, as seen in variable_tracker.
Are 'filter' and 'where' different in Spark DataFrame?
No, 'filter' and 'where' are synonyms in Spark and behave the same way, both applying row conditions as shown in the code example.
Visual Quiz - 3 Questions
Test your understanding
Look at the execution_table at step 3, how many rows remain after filtering?
A1 row
B2 rows
C3 rows
D0 rows
💡 Hint
Check the 'DataFrame State' column at step 3 in execution_table.
At which step are columns 'name' and 'age' selected?
AStep 1
BStep 3
CStep 2
DStep 4
💡 Hint
Look at the 'Action' column in execution_table for selecting columns.
If the filter condition changed to df.age > 25, how many rows would remain after filtering?
A1 row
B2 rows
C3 rows
D0 rows
💡 Hint
Refer to variable_tracker and consider which rows have age greater than 25.
Concept Snapshot
Select, filter, and where operations in Spark:
- Use select() to choose columns.
- Use filter() or where() to keep rows matching a condition.
- Both filter and where do the same filtering.
- Operations return new DataFrames, original unchanged.
- Chain select and filter for precise data extraction.
Full Transcript
This visual execution shows how to use select, filter, and where operations in Apache Spark DataFrames. We start by creating a DataFrame with three rows and three columns: id, name, and age. Then we select only the 'name' and 'age' columns, reducing the DataFrame to two columns but keeping all rows. Next, we apply a filter condition to keep only rows where age is greater than 20. This removes the row with age 20, leaving two rows. Finally, we show the result, which prints the filtered rows. The variable tracker shows how the DataFrame changes after each step. Key moments clarify common confusions about filtering and the equivalence of filter and where. The quiz tests understanding of row counts after filtering and the order of operations. The snapshot summarizes the key points for quick reference.