Challenge - 5 Problems
Schema Mastery Badge
Get all challenges correct to earn this badge!
Test your skills under time pressure!
❓ Predict Output
intermediate2:00remaining
Output of schema inference with mixed data types
What is the output schema of the following Spark DataFrame code snippet?
Apache Spark
from pyspark.sql import SparkSession from pyspark.sql.types import * spark = SparkSession.builder.getOrCreate() data = [(1, 'Alice', 29), (2, 'Bob', 'thirty'), (3, 'Cathy', 25)] df = spark.createDataFrame(data, ['id', 'name', 'age']) df.printSchema()
Attempts:
2 left
💡 Hint
Spark infers the column type based on all values. Mixed types in a column cause Spark to choose a common type that can hold all values.
✗ Incorrect
Since the 'age' column has both integers and a string ('thirty'), Spark infers the entire column as string type to accommodate all values.
❓ data_output
intermediate1:30remaining
Number of columns after explicit schema definition
Given the explicit schema below, how many columns will the resulting DataFrame have?
Apache Spark
from pyspark.sql import SparkSession from pyspark.sql.types import StructType, StructField, StringType, IntegerType spark = SparkSession.builder.getOrCreate() schema = StructType([ StructField('name', StringType(), True), StructField('age', IntegerType(), True), StructField('city', StringType(), True) ]) data = [('Alice', 29, 'NY'), ('Bob', 35, 'LA')] df = spark.createDataFrame(data, schema=schema) df.columns
Attempts:
2 left
💡 Hint
Count the fields defined in the StructType schema.
✗ Incorrect
The schema defines three fields: name, age, and city, so the DataFrame has 3 columns.
🔧 Debug
advanced2:00remaining
Identify the error in schema definition code
What error does the following code raise when creating a DataFrame with this schema?
Apache Spark
from pyspark.sql import SparkSession from pyspark.sql.types import StructType, StructField, IntegerType spark = SparkSession.builder.getOrCreate() schema = StructType([ StructField('id', IntegerType(), True), StructField('value', IntegerType()) ]) data = [(1, 100), (2, 200)] df = spark.createDataFrame(data, schema=schema)
Attempts:
2 left
💡 Hint
StructField constructor has nullable=True as the default value.
✗ Incorrect
StructField(name, dataType, nullable=True, metadata=None). The nullable argument defaults to True, so the code runs successfully without error.
🧠 Conceptual
advanced1:30remaining
Understanding schema inference behavior with null values
When Spark infers schema from JSON data containing null values in some fields, what is the expected behavior?
Attempts:
2 left
💡 Hint
Think about how Spark handles missing or null data during schema inference.
✗ Incorrect
Spark infers the data type from non-null values and marks fields with nulls as nullable to allow missing data.
🚀 Application
expert2:30remaining
Choosing schema definition for performance optimization
You have a large CSV dataset with millions of rows. You want to optimize Spark job performance by defining a schema instead of relying on inference. Which schema definition approach is best?
Attempts:
2 left
💡 Hint
Explicit schema helps Spark avoid scanning data multiple times.
✗ Incorrect
Defining an exact schema reduces overhead and speeds up reading large datasets by avoiding costly inference.