Challenge - 5 Problems

🎖️

Spark String Master

Get all challenges correct to earn this badge!

Test your skills under time pressure!

❓ Predict Output

intermediate

2:00remaining

Output of substring extraction in Spark

What is the output of this Spark code snippet that extracts a substring from a column?

Apache Spark

from pyspark.sql import SparkSession
from pyspark.sql.functions import substring

spark = SparkSession.builder.getOrCreate()
data = [(1, "DataScience"), (2, "SparkFun"), (3, "Python")]
df = spark.createDataFrame(data, ["id", "word"])
df.select("id", substring("word", 2, 4).alias("sub_word")).orderBy("id").show()

+---+--------+
|id |sub_word|
+---+--------+
|1  |ataS    |
|2  |park    |
|3  |yth     |
+---+--------+

+---+--------+
|id |sub_word|
+---+--------+
|1  |ataS    |
|2  |par     |
|3  |ytho    |
+---+--------+

+---+--------+
|id |sub_word|
+---+--------+
|1  |ataS    |
|2  |parkF   |
|3  |ytho    |
+---+--------+

+---+--------+
|id |sub_word|
+---+--------+
|1  |ataS    |
|2  |park    |
|3  |ytho    |
+---+--------+

Attempts:

2 left

❓ data_output

intermediate

2:00remaining

Result of trimming spaces in Spark DataFrame

Given this Spark DataFrame, what is the output after applying the trim function to the 'text' column?

Apache Spark

from pyspark.sql import SparkSession
from pyspark.sql.functions import trim

spark = SparkSession.builder.getOrCreate()
data = [(1, "  hello  "), (2, " world"), (3, "spark  ")]
df = spark.createDataFrame(data, ["id", "text"])
df.select("id", trim("text").alias("trimmed_text")).orderBy("id").show()

+---+------------+
|id |trimmed_text|
+---+------------+
|1  |hello       |
|2  |world       |
|3  |spark       |
+---+------------+

+---+------------+
|id |trimmed_text|
+---+------------+
|1  |  hello     |
|2  | world      |
|3  |spark       |
+---+------------+

+---+------------+
|id |trimmed_text|
+---+------------+
|1  |hello       |
|2  |world       |
|3  |spark  	   |
+---+------------+

+---+------------+
|id |trimmed_text|
+---+------------+
|1  |hello       |
|2  |world       |
|3  | spark      |
+---+------------+

Attempts:

2 left

🔧 Debug

advanced

2:00remaining

Identify the error in Spark string concatenation

What error will this Spark code raise when trying to concatenate two string columns?

Apache Spark

from pyspark.sql import SparkSession
from pyspark.sql.functions import concat

spark = SparkSession.builder.getOrCreate()
data = [(1, "Hello", "World"), (2, "Spark", "Fun")]
df = spark.createDataFrame(data, ["id", "col1", "col2"])
df.select("id", concat("col1", "col2").alias("greeting")).show()

ANo error, outputs concatenated strings without space

BAnalysisException: cannot resolve 'concat(col1, col2)' due to data type mismatch

CTypeError: concat() argument must be a Column or a list of Columns

DSyntaxError: invalid syntax in concat function call

Attempts:

2 left

🚀 Application

advanced

2:00remaining

Using regexp_replace to clean data

You want to remove all digits from the 'info' column in a Spark DataFrame. Which code snippet correctly does this?

Adf.withColumn('clean_info', regexp_replace('info', '\\d', '')).show()

Bdf.withColumn('clean_info', regexp_replace('info', '[0-9]+', '')).show()

Cdf.withColumn('clean_info', regexp_replace('info', '[a-z]', '')).show()

Ddf.withColumn('clean_info', regexp_replace('info', '\\D', '')).show()

Attempts:

2 left

🧠 Conceptual

expert

2:00remaining

Understanding Spark string function behavior with nulls

What is the result of applying the Spark function upper() to a column containing null values?

AThe upper() function returns null for null input values.

BThe upper() function raises a NullPointerException when encountering null values.

CThe upper() function converts null values to empty strings before converting to uppercase.

DThe upper() function ignores null values and leaves them unchanged in the output.

Attempts:

2 left