Apache Sparkdata~5 mins

Creating RDDs from collections and files in Apache Spark

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Introduction

We create RDDs to work with data in Spark. RDDs let us process data in parallel easily.

When you have a small list of data in your program and want to process it with Spark.

When you want to read data from a file and analyze it using Spark's distributed system.

When you want to convert existing data collections into RDDs for parallel processing.

When you want to load text data from files for word counting or filtering.

When you want to experiment with Spark using simple data before working with big files.

Syntax

Apache Spark

rdd = spark.sparkContext.parallelize(collection)
rdd = spark.sparkContext.textFile(file_path)

parallelize() creates an RDD from a local collection like a list.

textFile() reads a file and creates an RDD where each item is a line.

Examples

This creates an RDD from a list of numbers.

Apache Spark

numbers = [1, 2, 3, 4]
rdd_numbers = spark.sparkContext.parallelize(numbers)

This reads a text file and creates an RDD of lines.

Apache Spark

lines_rdd = spark.sparkContext.textFile("data/sample.txt")

Sample Program

This program shows how to create RDDs from a list and from a text file. It prints the contents of both RDDs.

Apache Spark

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("CreateRDDs").getOrCreate()

# Create RDD from a collection
fruits = ["apple", "banana", "cherry"]
rdd_fruits = spark.sparkContext.parallelize(fruits)

print("RDD from collection:")
print(rdd_fruits.collect())

# Create RDD from a text file
# For demonstration, create a small text file first
file_path = "/tmp/sample.txt"
with open(file_path, "w") as f:
    f.write("hello\nworld\nspark")

rdd_file = spark.sparkContext.textFile(file_path)

print("RDD from file lines:")
print(rdd_file.collect())

spark.stop()

OutputSuccess

Important Notes

Remember to call collect() to get all RDD items back to the driver for printing or inspection.

Creating RDDs from files is useful for big data, but for small data, collections are easier to use.

Always stop the Spark session when done to free resources.

Summary

Use parallelize() to create RDDs from local collections like lists.

Use textFile() to create RDDs from text files, where each line is an item.

RDDs let you process data in parallel with Spark easily.