0
0
Apache Sparkdata~20 mins

Schema validation in Apache Spark - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
Schema Validation Mastery
Get all challenges correct to earn this badge!
Test your skills under time pressure!
Predict Output
intermediate
2:00remaining
Output of schema validation with missing required field
Given the following Spark DataFrame schema and data, what will be the output when validating the data against the schema?
Apache Spark
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

spark = SparkSession.builder.master('local').appName('SchemaValidation').getOrCreate()

schema = StructType([
    StructField('name', StringType(), nullable=False),
    StructField('age', IntegerType(), nullable=False)
])

data = [('Alice', 30), (None, 25)]

df = spark.createDataFrame(data, schema=schema)
df.show()
AThrows an error during DataFrame creation due to null in non-nullable field
BShows both rows; second row has null for 'name' without error
CShows only the first row; second row is dropped silently
DShows both rows; null in 'name' replaced by empty string
Attempts:
2 left
💡 Hint
Think about how Spark handles nullable fields when creating DataFrames.
data_output
intermediate
2:00remaining
Result of schema enforcement with cast
What is the output DataFrame when applying the schema with enforced types to the following data?
Apache Spark
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType

spark = SparkSession.builder.master('local').appName('SchemaValidation').getOrCreate()

schema = StructType([
    StructField('id', IntegerType(), nullable=False)
])

data = [('1',), ('2',), ('three',)]

df = spark.createDataFrame(data, ['id'])
df_cast = df.selectExpr('cast(id as int) as id')
df_cast.show()
A
+----+
|  id|
+----+
|null|
|null|
|null|
+----+
BThrows a runtime error due to 'three' not castable to int
C
+----+
|  id|
+----+
|   1|
|   2|
|   3|
+----+
D
+----+
|  id|
+----+
|   1|
|   2|
|null|
+----+
Attempts:
2 left
💡 Hint
Casting invalid strings to int results in null values in Spark.
🔧 Debug
advanced
2:00remaining
Identify the error in schema validation code
What error will this Spark code raise when trying to create a DataFrame with the given schema and data?
Apache Spark
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

spark = SparkSession.builder.master('local').appName('SchemaValidation').getOrCreate()

schema = StructType([
    StructField('name', StringType(), nullable=False),
    StructField('age', IntegerType(), nullable=False)
])

data = [('Bob', 'thirty')]

df = spark.createDataFrame(data, schema=schema)
df.show()
ATypeError: Data type mismatch for field 'age'
BRuntimeException: Failed to convert value 'thirty' to int
CNo error; shows 'Bob' and null for age
DValueError: Cannot cast string 'thirty' to IntegerType
Attempts:
2 left
💡 Hint
Consider how Spark handles invalid type values during DataFrame creation.
🧠 Conceptual
advanced
2:00remaining
Understanding nullable field behavior in Spark schema
Which statement best describes how Spark handles nullable fields in a schema during DataFrame operations?
ASpark allows nulls in non-nullable fields during DataFrame creation but may fail during write operations
BSpark enforces non-nullable fields strictly and throws errors if nulls appear at any time
CSpark automatically replaces nulls in non-nullable fields with default values
DSpark ignores nullable settings and treats all fields as nullable
Attempts:
2 left
💡 Hint
Think about when Spark enforces schema constraints strictly.
🚀 Application
expert
3:00remaining
Detecting schema mismatch in streaming data ingestion
You have a Spark Structured Streaming job reading JSON data with a predefined schema. If incoming JSON messages have missing required fields, what will happen during streaming processing?
ARows with missing required fields will have nulls in those fields without failing the job
BThe streaming job will fail immediately with a schema mismatch error
CSpark will skip rows missing required fields silently without warning
DSpark will automatically infer missing fields and fill them with default values
Attempts:
2 left
💡 Hint
Consider how Spark handles schema enforcement in streaming ingestion.