Apache Sparkdata~30 mins

Creating RDDs from collections and files in Apache Spark - Try It Yourself

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Creating RDDs from collections and files

📖 Scenario: You are working with Apache Spark to process data. You want to learn how to create RDDs (Resilient Distributed Datasets) from simple collections and from files stored on your computer. This is a basic skill to start working with Spark data processing.

🎯 Goal: Build a Spark program that creates RDDs from a Python list and from a text file, so you can later process data in Spark.

📋 What You'll Learn

Create an RDD from a Python list using sc.parallelize()

Create an RDD from a text file using sc.textFile()

Use SparkContext variable named sc

Print the contents of both RDDs

💡 Why This Matters

🌍 Real World

Creating RDDs from collections and files is the first step in processing big data with Apache Spark. It lets you load data from memory or disk to start analysis.

💼 Career

Data engineers and data scientists use these skills to prepare data for distributed processing and analysis in Spark environments.

Progress0 / 4 steps

Create an RDD from a Python list

Create a Python list called data_list with these exact values: "apple", "banana", "cherry". Then create an RDD called rdd_from_list by using sc.parallelize(data_list).

Apache Spark

# Create data_list with 'apple', 'banana', 'cherry'
# Create rdd_from_list using sc.parallelize(data_list)

Need a hint?

Use sc.parallelize() to convert a Python list into an RDD.

Create an RDD from a text file

Create a variable called file_path and set it to the string "data/sample.txt". Then create an RDD called rdd_from_file by using sc.textFile(file_path).

Apache Spark

data_list = ["apple", "banana", "cherry"]
rdd_from_list = sc.parallelize(data_list)
# Set file_path to 'data/sample.txt'
# Create rdd_from_file using sc.textFile(file_path)

Need a hint?

Use sc.textFile() to read a text file into an RDD.

Collect data from both RDDs

Create two variables: list_data and file_data. Assign rdd_from_list.collect() to list_data and rdd_from_file.collect() to file_data.

Apache Spark

data_list = ["apple", "banana", "cherry"]
rdd_from_list = sc.parallelize(data_list)
file_path = "data/sample.txt"
rdd_from_file = sc.textFile(file_path)
# Collect data from rdd_from_list into list_data
# Collect data from rdd_from_file into file_data

Need a hint?

Use the collect() method to get all elements from an RDD as a list.

Print the collected data

Print the variables list_data and file_data each on its own line using two separate print() statements.

Apache Spark

data_list = ["apple", "banana", "cherry"]
rdd_from_list = sc.parallelize(data_list)
file_path = "data/sample.txt"
rdd_from_file = sc.textFile(file_path)
list_data = rdd_from_list.collect()
file_data = rdd_from_file.collect()
# Print list_data
# Print file_data

Need a hint?

Use print() to display the collected lists.