Challenge - 5 Problems

🎖️

Databricks Mastery Badge

Get all challenges correct to earn this badge!

Test your skills under time pressure!

🧠 Conceptual

intermediate

1:30remaining

What is the primary purpose of Databricks?

Databricks is a platform built on Apache Spark. What is its main goal?

ATo serve only as a cloud storage service for big data files.

BTo replace all SQL databases with a new query language.

CTo act as a visualization tool for business dashboards only.

DTo provide a unified environment for data engineering, data science, and machine learning.

Attempts:

2 left

❓ Predict Output

intermediate

1:30remaining

What is the output of this Databricks notebook cell?

Consider this PySpark code run in a Databricks notebook:

data = [(1, 'apple'), (2, 'banana'), (3, 'cherry')]
df = spark.createDataFrame(data, ['id', 'fruit'])
df.filter(df.id > 1).count()

What number will this output?

Apache Spark

data = [(1, 'apple'), (2, 'banana'), (3, 'cherry')]
df = spark.createDataFrame(data, ['id', 'fruit'])
df.filter(df.id > 1).count()

Attempts:

2 left

❓ data_output

advanced

2:00remaining

What does this Spark SQL query return in Databricks?

Given a table sales with columns region and amount, what is the output of this query?

SELECT region, SUM(amount) AS total_sales
FROM sales
GROUP BY region
ORDER BY total_sales DESC
LIMIT 2

AThe top 2 regions with the highest total sales amounts.

BAll regions sorted alphabetically with their total sales.

CThe total sales amount for all regions combined.

DAn error because LIMIT cannot be used with GROUP BY.

Attempts:

2 left

🔧 Debug

advanced

2:00remaining

Why does this Databricks PySpark code raise an error?

Look at this code snippet:

df = spark.createDataFrame([(1, 'a'), (2, 'b')], ['num', 'char'])
df.select('num' + 1).show()

What error will this code produce?

ASyntaxError: invalid syntax in select statement

BTypeError: can only concatenate str (not "int") to str

CAnalysisException: cannot resolve 'num + 1' due to data type mismatch

DAttributeError: 'DataFrame' object has no attribute 'num'

Attempts:

2 left

🚀 Application

expert

2:30remaining

How to optimize a large join operation in Databricks?

You have two large DataFrames in Databricks and want to join them efficiently. Which approach will best improve performance?

AWrite both DataFrames to CSV files and join using file system commands.

BConvert both DataFrames to Pandas and join locally.

CUse broadcast join by broadcasting the smaller DataFrame to all nodes.

DUse a nested loop join by collecting both DataFrames to the driver.

Attempts:

2 left