Apache Sparkdata~5 mins

What is an RDD (Resilient Distributed Dataset) in Apache Spark

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Introduction

An RDD is a way to store and work with data across many computers. It helps handle big data safely and quickly.

When you want to process large data that doesn't fit on one computer.

When you need to run the same operation on many pieces of data at once.

When you want your data processing to continue even if some computers fail.

When you want to control how data is split and shared across computers.

When you want to build data pipelines that can be reused and changed easily.

Syntax

Apache Spark

rdd = sc.parallelize([1, 2, 3, 4])

sc.parallelize() creates an RDD from a list.

RDDs are immutable, meaning you cannot change them once created, but you can make new RDDs from them.

Examples

Create an RDD from a small list of numbers.

Apache Spark

rdd = sc.parallelize([10, 20, 30])

Make a new RDD by doubling each number in the original RDD.

Apache Spark

rdd2 = rdd.map(lambda x: x * 2)

Make a new RDD with only numbers greater than 15.

Apache Spark

filtered_rdd = rdd.filter(lambda x: x > 15)

Sample Program

This program creates an RDD from a list of numbers, doubles each number, then keeps only those greater than 5. Finally, it collects and prints the results.

Apache Spark

from pyspark import SparkContext

sc = SparkContext('local', 'RDD Example')

# Create an RDD from a list
numbers = sc.parallelize([1, 2, 3, 4, 5])

# Double each number
doubled = numbers.map(lambda x: x * 2)

# Filter numbers greater than 5
filtered = doubled.filter(lambda x: x > 5)

# Collect results to driver
result = filtered.collect()

print(result)

sc.stop()

OutputSuccess

Important Notes

RDDs are the basic building blocks in Spark for distributed data processing.

Operations on RDDs are lazy; they run only when you ask for results (like with collect()).

RDDs can be created from files, other RDDs, or existing collections.

Summary

RDDs store data across many computers safely and efficiently.

You can create new RDDs by transforming existing ones with simple functions.

RDDs help handle big data and keep your work running even if some parts fail.