Which of the following best describes the main role of the Catalyst optimizer in Apache Spark?
Think about what happens between writing a query and running it efficiently.
The Catalyst optimizer takes the logical plan of a query and applies rules to optimize it before execution, improving performance.
Given the following Spark SQL code, what will be the output of df.explain(true) regarding the optimization?
val df = spark.read.json("people.json") val filtered = df.filter("age > 21") filtered.explain(true)
Explain(true) shows detailed plans including optimized physical plans.
The Catalyst optimizer pushes filters down to the data source when possible, which is visible in the physical plan output by explain(true).
Consider a DataFrame df with columns name and age. After applying df.filter("age > 30").select("name"), what will be the schema of the resulting DataFrame?
Think about what columns are selected after filtering.
The filter keeps rows where age > 30, but the select keeps only the name column, so the resulting schema has only name.
What error will occur if a user tries to filter a DataFrame using a non-existent column like df.filter("salary > 50000") when salary does not exist?
Think about how Catalyst checks column names before running queries.
Catalyst performs analysis and throws an AnalysisException if a referenced column does not exist in the schema.
Given two DataFrames df1 and df2, both large, what does the Catalyst optimizer do when you write df1.join(df2, "id")?
Choose the most accurate description of Catalyst's behavior regarding join order and optimization.
Think about how query optimizers improve join performance.
Catalyst uses cost-based optimization to reorder joins and choose join strategies that reduce data movement and improve speed.