Challenge - 5 Problems
Spark String Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
❓ Predict Output
intermediate2:00remaining
Output of substring extraction in Spark
What is the output of this Spark code snippet that extracts a substring from a column?
Apache Spark
from pyspark.sql import SparkSession from pyspark.sql.functions import substring spark = SparkSession.builder.getOrCreate() data = [(1, "DataScience"), (2, "SparkFun"), (3, "Python")] df = spark.createDataFrame(data, ["id", "word"]) df.select("id", substring("word", 2, 4).alias("sub_word")).orderBy("id").show()
Attempts:
2 left
💡 Hint
Remember that substring in Spark starts at position 1 and length is the number of characters to extract.
✗ Incorrect
The substring function extracts 4 characters starting from position 2 (1-based index). For 'DataScience', characters 2 to 5 are 'ataS'. For 'SparkFun', characters 2 to 5 are 'park'. For 'Python', characters 2 to 5 are 'ytho'.
❓ data_output
intermediate2:00remaining
Result of trimming spaces in Spark DataFrame
Given this Spark DataFrame, what is the output after applying the trim function to the 'text' column?
Apache Spark
from pyspark.sql import SparkSession from pyspark.sql.functions import trim spark = SparkSession.builder.getOrCreate() data = [(1, " hello "), (2, " world"), (3, "spark ")] df = spark.createDataFrame(data, ["id", "text"]) df.select("id", trim("text").alias("trimmed_text")).orderBy("id").show()
Attempts:
2 left
💡 Hint
The trim function removes spaces from both ends of the string.
✗ Incorrect
The trim function removes leading and trailing spaces. So ' hello ' becomes 'hello', ' world' becomes 'world', and 'spark ' becomes 'spark'.
🔧 Debug
advanced2:00remaining
Identify the error in Spark string concatenation
What error will this Spark code raise when trying to concatenate two string columns?
Apache Spark
from pyspark.sql import SparkSession from pyspark.sql.functions import concat spark = SparkSession.builder.getOrCreate() data = [(1, "Hello", "World"), (2, "Spark", "Fun")] df = spark.createDataFrame(data, ["id", "col1", "col2"]) df.select("id", concat("col1", "col2").alias("greeting")).show()
Attempts:
2 left
💡 Hint
Check the argument types passed to concat function in Spark.
✗ Incorrect
The concat function expects Column objects, not string column names. Passing strings causes a TypeError.
🚀 Application
advanced2:00remaining
Using regexp_replace to clean data
You want to remove all digits from the 'info' column in a Spark DataFrame. Which code snippet correctly does this?
Attempts:
2 left
💡 Hint
Digits are represented by \d or [0-9] in regex, but consider the difference between \d and \D.
✗ Incorrect
Option B uses '[0-9]+' which matches one or more digits and replaces them with empty string, effectively removing digits. Option B uses '\d' but the backslash needs to be escaped properly in Python strings. Option B removes letters, not digits. Option B removes non-digits, which is the opposite of what is wanted.
🧠 Conceptual
expert2:00remaining
Understanding Spark string function behavior with nulls
What is the result of applying the Spark function upper() to a column containing null values?
Attempts:
2 left
💡 Hint
Think about how Spark functions handle null inputs generally.
✗ Incorrect
Spark string functions like upper() return null if the input is null. They do not raise errors or convert nulls to empty strings.