0
0
Apache Sparkdata~5 mins

Why DataFrames are preferred over RDDs in Apache Spark

Choose your learning style9 modes available
Introduction

DataFrames are easier to use and faster than RDDs for working with big data. They help you write less code and get results quicker.

When you want to work with structured data like tables with rows and columns.
When you need faster data processing with built-in optimizations.
When you want to use SQL queries on your data easily.
When you want to avoid writing complex code for data transformations.
When you want automatic handling of data types and schema.
Syntax
Apache Spark
val df = spark.read.format("csv").option("header", "true").load("data.csv")
DataFrames have rows and columns like a spreadsheet or database table.
You can use SQL-like commands to query DataFrames.
Examples
RDD loads data as plain text lines, while DataFrame loads data with structure.
Apache Spark
val rdd = spark.sparkContext.textFile("data.txt")
val df = spark.read.text("data.txt")
DataFrame lets you select columns easily and display data.
Apache Spark
df.select("name", "age").show()
RDD requires manual parsing and filtering, which is more complex.
Apache Spark
rdd.map(line => line.split(",")).filter(arr => arr(2) == "USA")
Sample Program

This program shows how to filter data using RDD and DataFrame. DataFrame code is simpler and output is nicely formatted.

Apache Spark
import org.apache.spark.sql.SparkSession

object DataFrameVsRDD {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder.appName("DataFrameVsRDD").master("local").getOrCreate()
    val sc = spark.sparkContext

    // Create RDD from a list
    val rdd = sc.parallelize(Seq((1, "Alice", 29), (2, "Bob", 31), (3, "Cathy", 25)))

    // Filter RDD for age > 28
    val rddFiltered = rdd.filter(_._3 > 28)
    println("RDD filtered results:")
    rddFiltered.collect().foreach(println)

    // Create DataFrame from the same data
    import spark.implicits._
    val df = rdd.toDF("id", "name", "age")

    // Filter DataFrame for age > 28
    val dfFiltered = df.filter("age > 28")
    println("DataFrame filtered results:")
    dfFiltered.show()

    spark.stop()
  }
}
OutputSuccess
Important Notes

DataFrames use a schema to understand data types, making operations faster.

RDDs are low-level and require more code for simple tasks.

DataFrames support many built-in functions and SQL queries.

Summary

DataFrames are easier and faster for structured data.

They let you write less code and get better performance.

Use DataFrames when working with tables or SQL-like data.