Which of the following is a key reason why DataFrames are preferred over RDDs in Apache Spark?
Think about how Spark optimizes query execution for structured data.
DataFrames have a schema and use Spark's Catalyst optimizer to plan and optimize queries, which makes them faster and more efficient than RDDs that lack schema and optimization.
What will be the output of the following Spark code snippet?
rdd = spark.sparkContext.parallelize([1, 2, 3, 4]) df = rdd.toDF(['numbers']) count_rdd = rdd.count() count_df = df.count() print(count_rdd, count_df)
Both RDD and DataFrame represent the same data here.
Both RDD and DataFrame contain the same 4 elements, so their counts are both 4.
Given the following code, what is the schema of the DataFrame?
data = [(1, 'Alice', 29), (2, 'Bob', 31)] df = spark.createDataFrame(data, schema=['id', 'name', 'age']) df.printSchema()
Check the default data types inferred by Spark for integers and strings.
Spark infers integer numbers as long type by default and strings as string type with nullable true.
What error will occur when running this code?
rdd = spark.sparkContext.parallelize([(1, 'a'), (2, 'b')]) rdd.select('1').show()
Consider which Spark objects support the select() method.
RDDs do not have the select() method; only DataFrames do. Trying to call select on an RDD causes AttributeError.
You have a large dataset and want to calculate the average age grouped by city. Which approach will be faster and why?
Think about how Spark optimizes structured data operations.
DataFrames use Spark's Catalyst optimizer and Tungsten execution engine to optimize aggregation operations, making them faster than manual RDD operations.