We create RDDs to work with data in Spark. RDDs let us process data in parallel easily.
0
0
Creating RDDs from collections and files in Apache Spark
Introduction
When you have a small list of data in your program and want to process it with Spark.
When you want to read data from a file and analyze it using Spark's distributed system.
When you want to convert existing data collections into RDDs for parallel processing.
When you want to load text data from files for word counting or filtering.
When you want to experiment with Spark using simple data before working with big files.
Syntax
Apache Spark
rdd = spark.sparkContext.parallelize(collection) rdd = spark.sparkContext.textFile(file_path)
parallelize() creates an RDD from a local collection like a list.
textFile() reads a file and creates an RDD where each item is a line.
Examples
This creates an RDD from a list of numbers.
Apache Spark
numbers = [1, 2, 3, 4] rdd_numbers = spark.sparkContext.parallelize(numbers)
This reads a text file and creates an RDD of lines.
Apache Spark
lines_rdd = spark.sparkContext.textFile("data/sample.txt")Sample Program
This program shows how to create RDDs from a list and from a text file. It prints the contents of both RDDs.
Apache Spark
from pyspark.sql import SparkSession spark = SparkSession.builder.appName("CreateRDDs").getOrCreate() # Create RDD from a collection fruits = ["apple", "banana", "cherry"] rdd_fruits = spark.sparkContext.parallelize(fruits) print("RDD from collection:") print(rdd_fruits.collect()) # Create RDD from a text file # For demonstration, create a small text file first file_path = "/tmp/sample.txt" with open(file_path, "w") as f: f.write("hello\nworld\nspark") rdd_file = spark.sparkContext.textFile(file_path) print("RDD from file lines:") print(rdd_file.collect()) spark.stop()
OutputSuccess
Important Notes
Remember to call collect() to get all RDD items back to the driver for printing or inspection.
Creating RDDs from files is useful for big data, but for small data, collections are easier to use.
Always stop the Spark session when done to free resources.
Summary
Use parallelize() to create RDDs from local collections like lists.
Use textFile() to create RDDs from text files, where each line is an item.
RDDs let you process data in parallel with Spark easily.