0
0
Apache Sparkdata~5 mins

Select, filter, and where operations in Apache Spark

Choose your learning style9 modes available
Introduction

We use select, filter, and where to pick specific columns and rows from data. This helps us focus on the important parts.

When you want to see only some columns from a big table.
When you want to find rows that meet certain conditions, like age > 30.
When cleaning data by removing unwanted rows.
When preparing data for analysis by selecting relevant information.
When combining data and you need to filter before joining.
Syntax
Apache Spark
df.select("column1", "column2")
df.filter(df["column1"] > value)
df.where("column1 > value")

select chooses columns.

filter and where do the same thing: pick rows by condition.

Examples
Selects only the 'name' and 'age' columns from the data.
Apache Spark
df.select("name", "age")
Filters rows where the 'age' column is greater than 25.
Apache Spark
df.filter(df["age"] > 25)
Filters rows where the 'salary' column is more than 50000 using a string condition.
Apache Spark
df.where("salary > 50000")
Sample Program

This program creates a small table of people with their age and salary. It then shows how to pick columns and filter rows using select, filter, and where.

Apache Spark
from pyspark.sql import SparkSession

# Start Spark session
spark = SparkSession.builder.appName("SelectFilterWhereExample").getOrCreate()

# Create sample data
data = [
    ("Alice", 30, 60000),
    ("Bob", 22, 40000),
    ("Charlie", 35, 70000),
    ("David", 28, 45000)
]

# Define columns
columns = ["name", "age", "salary"]

# Create DataFrame
df = spark.createDataFrame(data, columns)

# Select name and salary columns
selected_df = df.select("name", "salary")

# Filter rows where age is greater than 25
filtered_df = df.filter(df["age"] > 25)

# Use where to filter rows where salary is greater than 50000
where_df = df.where("salary > 50000")

# Show results
print("Selected columns (name, salary):")
selected_df.show()

print("Filtered rows (age > 25):")
filtered_df.show()

print("Where rows (salary > 50000):")
where_df.show()

# Stop Spark session
spark.stop()
OutputSuccess
Important Notes

filter and where do the same job; use whichever you find easier.

select only changes columns, it does not filter rows.

Conditions in filter can be written using column expressions or SQL strings in where.

Summary

Select picks columns you want to keep.

Filter and where pick rows that match a condition.

Use these to focus your data for easier analysis.