0
0
Apache Sparkdata~20 mins

Databricks platform overview in Apache Spark - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
Databricks Mastery Badge
Get all challenges correct to earn this badge!
Test your skills under time pressure!
🧠 Conceptual
intermediate
1:30remaining
What is the primary purpose of Databricks?

Databricks is a platform built on Apache Spark. What is its main goal?

ATo serve only as a cloud storage service for big data files.
BTo replace all SQL databases with a new query language.
CTo act as a visualization tool for business dashboards only.
DTo provide a unified environment for data engineering, data science, and machine learning.
Attempts:
2 left
💡 Hint

Think about what tasks Databricks helps teams do together on big data.

Predict Output
intermediate
1:30remaining
What is the output of this Databricks notebook cell?

Consider this PySpark code run in a Databricks notebook:

data = [(1, 'apple'), (2, 'banana'), (3, 'cherry')]
df = spark.createDataFrame(data, ['id', 'fruit'])
df.filter(df.id > 1).count()

What number will this output?

Apache Spark
data = [(1, 'apple'), (2, 'banana'), (3, 'cherry')]
df = spark.createDataFrame(data, ['id', 'fruit'])
df.filter(df.id > 1).count()
A2
B3
C1
D0
Attempts:
2 left
💡 Hint

Count rows where id is greater than 1.

data_output
advanced
2:00remaining
What does this Spark SQL query return in Databricks?

Given a table sales with columns region and amount, what is the output of this query?

SELECT region, SUM(amount) AS total_sales
FROM sales
GROUP BY region
ORDER BY total_sales DESC
LIMIT 2
AThe top 2 regions with the highest total sales amounts.
BAll regions sorted alphabetically with their total sales.
CThe total sales amount for all regions combined.
DAn error because LIMIT cannot be used with GROUP BY.
Attempts:
2 left
💡 Hint

Look at the GROUP BY, ORDER BY, and LIMIT clauses.

🔧 Debug
advanced
2:00remaining
Why does this Databricks PySpark code raise an error?

Look at this code snippet:

df = spark.createDataFrame([(1, 'a'), (2, 'b')], ['num', 'char'])
df.select('num' + 1).show()

What error will this code produce?

ASyntaxError: invalid syntax in select statement
BTypeError: can only concatenate str (not "int") to str
CAnalysisException: cannot resolve 'num + 1' due to data type mismatch
DAttributeError: 'DataFrame' object has no attribute 'num'
Attempts:
2 left
💡 Hint

Consider how to add a number to a column in PySpark.

🚀 Application
expert
2:30remaining
How to optimize a large join operation in Databricks?

You have two large DataFrames in Databricks and want to join them efficiently. Which approach will best improve performance?

AWrite both DataFrames to CSV files and join using file system commands.
BConvert both DataFrames to Pandas and join locally.
CUse broadcast join by broadcasting the smaller DataFrame to all nodes.
DUse a nested loop join by collecting both DataFrames to the driver.
Attempts:
2 left
💡 Hint

Think about how Spark handles joins with data size differences.