0
0
Apache-sparkConceptBeginner · 3 min read

What is RDD in Spark: Definition, Example, and Use Cases

RDD stands for Resilient Distributed Dataset in Apache Spark. It is a fundamental data structure that represents an immutable, distributed collection of objects that can be processed in parallel across a cluster.
⚙️

How It Works

Think of an RDD as a big collection of data split into chunks, spread across many computers. Each chunk can be worked on at the same time, which makes processing large data very fast. If one computer fails, Spark can rebuild the lost data using the original instructions, so nothing is lost.

This is like having a recipe for a cake and several friends each baking a part of it in their own kitchen. If one friend’s cake burns, you can use the recipe to make that part again without starting from scratch. This makes RDDs reliable and efficient for big data tasks.

💻

Example

This example shows how to create an RDD from a list of numbers, then filter out even numbers and collect the results.

python
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("RDD Example").getOrCreate()
sc = spark.sparkContext

# Create an RDD from a list
numbers = sc.parallelize([1, 2, 3, 4, 5, 6])

# Filter even numbers
even_numbers = numbers.filter(lambda x: x % 2 == 0)

# Collect results
result = even_numbers.collect()
print(result)

spark.stop()
Output
[2, 4, 6]
🎯

When to Use

Use RDDs when you need fine control over your data processing or when working with unstructured data. They are great for low-level transformations and actions where you want to control how data is partitioned and processed.

For example, if you are processing logs, sensor data, or performing custom computations that don’t fit well into higher-level APIs like DataFrames, RDDs give you the flexibility and fault tolerance you need.

Key Points

  • Immutable: Once created, RDDs cannot be changed.
  • Distributed: Data is split across many machines for parallel processing.
  • Fault-tolerant: Can recover lost data automatically.
  • Lazy evaluation: Computations are only done when needed.
  • Low-level API: Gives control over data processing steps.

Key Takeaways

RDD is a core Spark data structure for distributed, fault-tolerant data processing.
It splits data across machines and processes it in parallel for speed.
RDDs are immutable and support lazy evaluation for efficiency.
Use RDDs when you need fine control or work with unstructured data.
Spark can rebuild lost data automatically if a node fails.