0
0
Apache Sparkdata~30 mins

Why DataFrames are preferred over RDDs in Apache Spark - See It in Action

Choose your learning style9 modes available
Why DataFrames are preferred over RDDs
📖 Scenario: Imagine you work with big data in a company. You have two ways to handle data: using RDDs (Resilient Distributed Datasets) or DataFrames. You want to understand why DataFrames are often better for your work.
🎯 Goal: You will create a simple dataset, set a configuration to select columns, use DataFrame operations to filter and select data, and finally print the result. This will show why DataFrames are easier and faster than RDDs.
📋 What You'll Learn
Create a Spark DataFrame from a list of tuples with exact values
Create a configuration variable to select a column name
Use DataFrame methods to filter rows where age is greater than 25
Select the configured column from the filtered DataFrame
Print the resulting DataFrame to show the output
💡 Why This Matters
🌍 Real World
DataFrames are widely used in big data companies to process and analyze large datasets efficiently.
💼 Career
Knowing why DataFrames are preferred helps you write better Spark code and improves your chances in data engineering and data science roles.
Progress0 / 4 steps
1
Create the initial DataFrame
Create a Spark DataFrame called df from the list data = [(1, 'Alice', 23), (2, 'Bob', 30), (3, 'Charlie', 25)] with columns ['id', 'name', 'age'].
Apache Spark
Need a hint?

Use spark.createDataFrame() with the list data and the list columns.

2
Set the column to select
Create a variable called select_column and set it to the string 'name'.
Apache Spark
Need a hint?

Just assign the string 'name' to the variable select_column.

3
Filter and select data using DataFrame methods
Use DataFrame methods to create a new DataFrame called filtered_df by filtering df where age is greater than 25, then select the column stored in select_column.
Apache Spark
Need a hint?

Use df.filter(df.age > 25) and then .select(select_column).

4
Print the filtered DataFrame
Use filtered_df.show() to print the filtered DataFrame.
Apache Spark
Need a hint?

Use filtered_df.show() to display the DataFrame.