Databricks is a platform built on Apache Spark. What is its main goal?
Think about what tasks Databricks helps teams do together on big data.
Databricks combines data engineering, data science, and machine learning in one platform, making it easier to collaborate and process big data using Apache Spark.
Consider this PySpark code run in a Databricks notebook:
data = [(1, 'apple'), (2, 'banana'), (3, 'cherry')] df = spark.createDataFrame(data, ['id', 'fruit']) df.filter(df.id > 1).count()
What number will this output?
data = [(1, 'apple'), (2, 'banana'), (3, 'cherry')] df = spark.createDataFrame(data, ['id', 'fruit']) df.filter(df.id > 1).count()
Count rows where id is greater than 1.
Rows with id 2 and 3 satisfy the condition, so count is 2.
Given a table sales with columns region and amount, what is the output of this query?
SELECT region, SUM(amount) AS total_sales FROM sales GROUP BY region ORDER BY total_sales DESC LIMIT 2
Look at the GROUP BY, ORDER BY, and LIMIT clauses.
The query groups sales by region, sums amounts, orders by total sales descending, and returns only the top 2 regions.
Look at this code snippet:
df = spark.createDataFrame([(1, 'a'), (2, 'b')], ['num', 'char'])
df.select('num' + 1).show()What error will this code produce?
Consider how to add a number to a column in PySpark.
In PySpark, you cannot use Python operators directly on columns. You must use column expressions like df.num + 1 or col('num') + 1. The code tries to add 1 to a string literal 'num', causing a TypeError.
You have two large DataFrames in Databricks and want to join them efficiently. Which approach will best improve performance?
Think about how Spark handles joins with data size differences.
Broadcast join sends the smaller DataFrame to all worker nodes, avoiding expensive shuffles and speeding up the join.