Challenge - 5 Problems
Schema Validation Mastery
Get all challenges correct to earn this badge!
Test your skills under time pressure!
❓ Predict Output
intermediate2:00remaining
Output of schema validation with missing required field
Given the following Spark DataFrame schema and data, what will be the output when validating the data against the schema?
Apache Spark
from pyspark.sql import SparkSession from pyspark.sql.types import StructType, StructField, StringType, IntegerType spark = SparkSession.builder.master('local').appName('SchemaValidation').getOrCreate() schema = StructType([ StructField('name', StringType(), nullable=False), StructField('age', IntegerType(), nullable=False) ]) data = [('Alice', 30), (None, 25)] df = spark.createDataFrame(data, schema=schema) df.show()
Attempts:
2 left
💡 Hint
Think about how Spark handles nullable fields when creating DataFrames.
✗ Incorrect
Spark allows creating DataFrames with nulls in non-nullable fields but may cause errors later during operations like writing to certain formats. The show() method will display both rows including the null value.
❓ data_output
intermediate2:00remaining
Result of schema enforcement with cast
What is the output DataFrame when applying the schema with enforced types to the following data?
Apache Spark
from pyspark.sql import SparkSession from pyspark.sql.types import StructType, StructField, IntegerType spark = SparkSession.builder.master('local').appName('SchemaValidation').getOrCreate() schema = StructType([ StructField('id', IntegerType(), nullable=False) ]) data = [('1',), ('2',), ('three',)] df = spark.createDataFrame(data, ['id']) df_cast = df.selectExpr('cast(id as int) as id') df_cast.show()
Attempts:
2 left
💡 Hint
Casting invalid strings to int results in null values in Spark.
✗ Incorrect
When casting 'three' to int, Spark returns null instead of throwing an error. The other values are cast correctly.
🔧 Debug
advanced2:00remaining
Identify the error in schema validation code
What error will this Spark code raise when trying to create a DataFrame with the given schema and data?
Apache Spark
from pyspark.sql import SparkSession from pyspark.sql.types import StructType, StructField, StringType, IntegerType spark = SparkSession.builder.master('local').appName('SchemaValidation').getOrCreate() schema = StructType([ StructField('name', StringType(), nullable=False), StructField('age', IntegerType(), nullable=False) ]) data = [('Bob', 'thirty')] df = spark.createDataFrame(data, schema=schema) df.show()
Attempts:
2 left
💡 Hint
Consider how Spark handles invalid type values during DataFrame creation.
✗ Incorrect
Spark does not throw an error at DataFrame creation for invalid type values but sets them to null. The show() method will display 'Bob' and null for age.
🧠 Conceptual
advanced2:00remaining
Understanding nullable field behavior in Spark schema
Which statement best describes how Spark handles nullable fields in a schema during DataFrame operations?
Attempts:
2 left
💡 Hint
Think about when Spark enforces schema constraints strictly.
✗ Incorrect
Spark does not enforce non-nullable constraints strictly during DataFrame creation or transformations but may throw errors when writing data to external storage that requires strict schema adherence.
🚀 Application
expert3:00remaining
Detecting schema mismatch in streaming data ingestion
You have a Spark Structured Streaming job reading JSON data with a predefined schema. If incoming JSON messages have missing required fields, what will happen during streaming processing?
Attempts:
2 left
💡 Hint
Consider how Spark handles schema enforcement in streaming ingestion.
✗ Incorrect
In Spark Structured Streaming, missing fields in JSON data are set to null if the schema expects them. The job continues running unless explicit validation or checks are added.