0
0
Apache Sparkdata~30 mins

Select, filter, and where operations in Apache Spark - Mini Project: Build & Apply

Choose your learning style9 modes available
Select, filter, and where operations
📖 Scenario: You work as a data analyst for a small online bookstore. You have a dataset of books with details like title, author, genre, and price. Your manager wants you to find books that are affordable and belong to a specific genre.
🎯 Goal: Build a Spark DataFrame with book data, set a price limit, filter books by genre and price using select, filter, and where operations, and display the filtered results.
📋 What You'll Learn
Create a Spark DataFrame with exact book data
Create a variable for the maximum price allowed
Use select to choose specific columns
Use filter or where to select books by genre and price
Print the final filtered DataFrame
💡 Why This Matters
🌍 Real World
Filtering and selecting data is a common task in data analysis to focus on relevant information, such as affordable books in a specific genre.
💼 Career
Data analysts and data scientists often use select, filter, and where operations in Spark to prepare and explore large datasets efficiently.
Progress0 / 4 steps
1
Create the books DataFrame
Create a Spark DataFrame called books_df with these exact entries: {'title': 'The Alchemist', 'author': 'Paulo Coelho', 'genre': 'Fiction', 'price': 10}, {'title': 'Deep Learning', 'author': 'Ian Goodfellow', 'genre': 'Education', 'price': 50}, {'title': 'Clean Code', 'author': 'Robert C. Martin', 'genre': 'Education', 'price': 40}, {'title': 'The Hobbit', 'author': 'J.R.R. Tolkien', 'genre': 'Fiction', 'price': 15}, {'title': 'Cooking 101', 'author': 'Jamie Oliver', 'genre': 'Cooking', 'price': 20}. Use Spark's createDataFrame method with a list of dictionaries.
Apache Spark
Need a hint?

Use spark.createDataFrame with a list of dictionaries containing the exact book details.

2
Set the maximum price limit
Create a variable called max_price and set it to 20 to represent the maximum price allowed for books.
Apache Spark
Need a hint?

Just create a variable named max_price and assign it the value 20.

3
Filter books by genre and price
Use select to choose the columns 'title', 'author', and 'price' from books_df. Then use filter or where to keep only books where genre is 'Fiction' and price is less than or equal to max_price. Save the result in a new DataFrame called filtered_books.
Apache Spark
Need a hint?

Use select('title', 'author', 'price') and then where with conditions joined by &.

4
Display the filtered books
Use print and filtered_books.show() to display the filtered books DataFrame.
Apache Spark
Need a hint?

Use print('Filtered books:') and then filtered_books.show() to display the data.