0
0
Apache Sparkdata~20 mins

Delta Lake introduction in Apache Spark - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
Delta Lake Mastery
Get all challenges correct to earn this badge!
Test your skills under time pressure!
Predict Output
intermediate
2:00remaining
Output of Delta Lake table creation and query
What is the output of the following Apache Spark code using Delta Lake?
Apache Spark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("DeltaExample").getOrCreate()
data = [(1, "apple"), (2, "banana"), (3, "cherry")]
df = spark.createDataFrame(data, ["id", "fruit"])
df.write.format("delta").mode("overwrite").save("/tmp/delta-table")
df2 = spark.read.format("delta").load("/tmp/delta-table")
df2.filter(df2.id > 1).count()
A2
B0
C1
D3
Attempts:
2 left
💡 Hint
Count rows where id is greater than 1.
data_output
intermediate
2:00remaining
Result of Delta Lake update operation
Given a Delta Lake table with data [(1, 10), (2, 20), (3, 30)] stored at '/tmp/delta-update', what is the content of the table after running this update code?
Apache Spark
from delta.tables import DeltaTable
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("DeltaUpdate").getOrCreate()
data = [(1, 10), (2, 20), (3, 30)]
df = spark.createDataFrame(data, ["id", "value"])
df.write.format("delta").mode("overwrite").save("/tmp/delta-update")
deltaTable = DeltaTable.forPath(spark, "/tmp/delta-update")
deltaTable.update(condition = "id == 2", set = {"value": "value + 5"})
deltaTable.toDF().orderBy("id").collect()
A[(1, 10), (2, 20), (3, 35)]
B[(1, 10), (2, 25), (3, 30)]
C[(1, 15), (2, 20), (3, 30)]
D[(1, 10), (2, 5), (3, 30)]
Attempts:
2 left
💡 Hint
Only the row with id 2 is updated by adding 5 to its value.
🔧 Debug
advanced
2:00remaining
Identify the error in Delta Lake merge code
What error will this Delta Lake merge code produce?
Apache Spark
from delta.tables import DeltaTable
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("DeltaMerge").getOrCreate()
data_target = [(1, "a"), (2, "b")]
data_source = [(2, "bb"), (3, "c")]
df_target = spark.createDataFrame(data_target, ["id", "value"])
df_source = spark.createDataFrame(data_source, ["id", "value"])
df_target.write.format("delta").mode("overwrite").save("/tmp/delta-merge")
deltaTable = DeltaTable.forPath(spark, "/tmp/delta-merge")
deltaTable.alias("t").merge(
    df_source.alias("s"),
    "t.id = s.id"
).whenMatchedUpdate(set = {"value": "s.value"})
.whenNotMatchedInsert(values = {"id": "s.id", "value": "s.value"})
.execute()
ASyntaxError due to missing colon in merge condition
BNo error, merge executes successfully
CAnalysisException due to missing Delta Lake package
DTypeError because set and values expect column expressions, not strings
Attempts:
2 left
💡 Hint
Check the types used in set and values arguments in merge.
visualization
advanced
2:00remaining
Visualize Delta Lake version history
Which option correctly describes how to visualize the version history of a Delta Lake table?
AUse 'DESCRIBE HISTORY delta.`/path/to/table`' SQL command to get version info and plot with matplotlib
BRun 'SHOW VERSIONS delta.`/path/to/table`' to get history and use seaborn to plot
CUse 'SELECT * FROM delta.`/path/to/table` VERSION AS OF 0' to get history and plot with pandas
DRun 'HISTORY delta.`/path/to/table`' command and plot with plotly
Attempts:
2 left
💡 Hint
Delta Lake supports DESCRIBE HISTORY command for version info.
🧠 Conceptual
expert
2:00remaining
Understanding Delta Lake ACID guarantees
Which statement best explains how Delta Lake ensures ACID transactions on big data?
ADelta Lake uses a single writer lock to prevent concurrent writes, ensuring durability.
BDelta Lake relies on HDFS replication to guarantee atomic writes and consistency.
CDelta Lake uses a transaction log that records all changes atomically and supports snapshot isolation for readers.
DDelta Lake stores data in a relational database to enforce ACID properties.
Attempts:
2 left
💡 Hint
Think about how Delta Lake tracks changes and manages concurrent access.