Given a Spark DataFrame df with columns id and value:
+---+-----+ |id |value| +---+-----+ |1 |10 | |2 |20 | |3 |30 | |4 |40 | +---+-----+
What will be the result of this SQL query?
df.createOrReplaceTempView("table1")
spark.sql("SELECT id, value * 2 AS double_value FROM table1 WHERE id > 2 ORDER BY id").show()from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() data = [(1,10),(2,20),(3,30),(4,40)] df = spark.createDataFrame(data, ["id", "value"]) df.createOrReplaceTempView("table1") result = spark.sql("SELECT id, value * 2 AS double_value FROM table1 WHERE id > 2 ORDER BY id") result.show()
Remember the WHERE clause filters rows where id is greater than 2.
The query selects rows with id greater than 2, which are 3 and 4. It multiplies the value by 2, so 30*2=60 and 40*2=80.
Consider a Spark DataFrame df with the following data:
+---+-------+ |id |status | +---+-------+ |1 |active | |2 |inactive| |3 |active | |4 |inactive| |5 |active | +---+-------+
What is the number of rows returned by this SQL query?
df.createOrReplaceTempView("users")
spark.sql("SELECT * FROM users WHERE status = 'active'").count()from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() data = [(1,"active"),(2,"inactive"),(3,"active"),(4,"inactive"),(5,"active")] df = spark.createDataFrame(data, ["id", "status"]) df.createOrReplaceTempView("users") count_active = spark.sql("SELECT * FROM users WHERE status = 'active'").count()
Count only rows where status is exactly 'active'.
Rows with id 1, 3, and 5 have status 'active', so total count is 3.
Given a Spark DataFrame df with columns name and age, what error will this code produce?
df.createOrReplaceTempView("people")
spark.sql("SELECT name, age FROM people WHERE age > '30'").show()from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() data = [("Alice", 25), ("Bob", 35), ("Charlie", 30)] df = spark.createDataFrame(data, ["name", "age"]) df.createOrReplaceTempView("people") spark.sql("SELECT name, age FROM people WHERE age > '30'").show()
Check how Spark SQL handles comparisons between integers and strings.
Spark SQL automatically casts string '30' to integer 30 for comparison, so no error occurs and rows with age > 30 are returned.
Given a Spark DataFrame sales with columns product and amount:
+---------+-------+ |product |amount | +---------+-------+ |A |100 | |B |200 | |A |150 | |B |50 | |C |300 | +---------+-------+
After running this SQL query and plotting a bar chart of total sales per product, which chart matches the data?
sales.createOrReplaceTempView("sales_table")
result = spark.sql("SELECT product, SUM(amount) AS total FROM sales_table GROUP BY product ORDER BY product")
result.show()from pyspark.sql import SparkSession import matplotlib.pyplot as plt spark = SparkSession.builder.getOrCreate() data = [("A",100),("B",200),("A",150),("B",50),("C",300)] sales = spark.createDataFrame(data, ["product", "amount"]) sales.createOrReplaceTempView("sales_table") result = spark.sql("SELECT product, SUM(amount) AS total FROM sales_table GROUP BY product ORDER BY product") pandas_df = result.toPandas() plt.bar(pandas_df['product'], pandas_df['total']) plt.xlabel('Product') plt.ylabel('Total Sales') plt.title('Total Sales per Product') plt.show()
Sum amounts per product: add all amounts for each product.
Product A total: 100+150=250, B total: 200+50=250, C total: 300.
Suppose you join two Spark DataFrames df1 and df2 on a key column, and both have a column named value. What happens to the value columns in the resulting DataFrame?
Think about how Spark handles columns with the same name after join.
Spark SQL appends suffixes like _1 to duplicate columns to keep both without error.