Challenge - 5 Problems

🎖️

Schema Validation Mastery

Get all challenges correct to earn this badge!

Test your skills under time pressure!

❓ Predict Output

intermediate

2:00remaining

Output of schema validation with missing required field

Given the following Spark DataFrame schema and data, what will be the output when validating the data against the schema?

Apache Spark

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

spark = SparkSession.builder.master('local').appName('SchemaValidation').getOrCreate()

schema = StructType([
    StructField('name', StringType(), nullable=False),
    StructField('age', IntegerType(), nullable=False)
])

data = [('Alice', 30), (None, 25)]

df = spark.createDataFrame(data, schema=schema)
df.show()

AThrows an error during DataFrame creation due to null in non-nullable field

BShows both rows; second row has null for 'name' without error

CShows only the first row; second row is dropped silently

DShows both rows; null in 'name' replaced by empty string

Attempts:

2 left

❓ data_output

intermediate

2:00remaining

Result of schema enforcement with cast

What is the output DataFrame when applying the schema with enforced types to the following data?

Apache Spark

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType

spark = SparkSession.builder.master('local').appName('SchemaValidation').getOrCreate()

schema = StructType([
    StructField('id', IntegerType(), nullable=False)
])

data = [('1',), ('2',), ('three',)]

df = spark.createDataFrame(data, ['id'])
df_cast = df.selectExpr('cast(id as int) as id')
df_cast.show()

+----+
|  id|
+----+
|null|
|null|
|null|
+----+

BThrows a runtime error due to 'three' not castable to int

+----+
|  id|
+----+
|   1|
|   2|
|   3|
+----+

+----+
|  id|
+----+
|   1|
|   2|
|null|
+----+

Attempts:

2 left

🔧 Debug

advanced

2:00remaining

Identify the error in schema validation code

What error will this Spark code raise when trying to create a DataFrame with the given schema and data?

Apache Spark

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

spark = SparkSession.builder.master('local').appName('SchemaValidation').getOrCreate()

schema = StructType([
    StructField('name', StringType(), nullable=False),
    StructField('age', IntegerType(), nullable=False)
])

data = [('Bob', 'thirty')]

df = spark.createDataFrame(data, schema=schema)
df.show()

ATypeError: Data type mismatch for field 'age'

BRuntimeException: Failed to convert value 'thirty' to int

CNo error; shows 'Bob' and null for age

DValueError: Cannot cast string 'thirty' to IntegerType

Attempts:

2 left

🧠 Conceptual

advanced

2:00remaining

Understanding nullable field behavior in Spark schema

Which statement best describes how Spark handles nullable fields in a schema during DataFrame operations?

ASpark allows nulls in non-nullable fields during DataFrame creation but may fail during write operations

BSpark enforces non-nullable fields strictly and throws errors if nulls appear at any time

CSpark automatically replaces nulls in non-nullable fields with default values

DSpark ignores nullable settings and treats all fields as nullable

Attempts:

2 left

🚀 Application

expert

3:00remaining

Detecting schema mismatch in streaming data ingestion

You have a Spark Structured Streaming job reading JSON data with a predefined schema. If incoming JSON messages have missing required fields, what will happen during streaming processing?

ARows with missing required fields will have nulls in those fields without failing the job

BThe streaming job will fail immediately with a schema mismatch error

CSpark will skip rows missing required fields silently without warning

DSpark will automatically infer missing fields and fill them with default values

Attempts:

2 left