0
0
Apache Sparkdata~20 mins

Schema definition and inference in Apache Spark - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
Schema Mastery Badge
Get all challenges correct to earn this badge!
Test your skills under time pressure!
Predict Output
intermediate
2:00remaining
Output of schema inference with mixed data types
What is the output schema of the following Spark DataFrame code snippet?
Apache Spark
from pyspark.sql import SparkSession
from pyspark.sql.types import *
spark = SparkSession.builder.getOrCreate()
data = [(1, 'Alice', 29), (2, 'Bob', 'thirty'), (3, 'Cathy', 25)]
df = spark.createDataFrame(data, ['id', 'name', 'age'])
df.printSchema()
A
root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)
B
root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- age: string (nullable = true)
C
root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- age: long (nullable = true)
D
root
 |-- id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- age: string (nullable = true)
Attempts:
2 left
💡 Hint
Spark infers the column type based on all values. Mixed types in a column cause Spark to choose a common type that can hold all values.
data_output
intermediate
1:30remaining
Number of columns after explicit schema definition
Given the explicit schema below, how many columns will the resulting DataFrame have?
Apache Spark
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
spark = SparkSession.builder.getOrCreate()
schema = StructType([
    StructField('name', StringType(), True),
    StructField('age', IntegerType(), True),
    StructField('city', StringType(), True)
])
data = [('Alice', 29, 'NY'), ('Bob', 35, 'LA')]
df = spark.createDataFrame(data, schema=schema)
df.columns
A3
B2
C4
D1
Attempts:
2 left
💡 Hint
Count the fields defined in the StructType schema.
🔧 Debug
advanced
2:00remaining
Identify the error in schema definition code
What error does the following code raise when creating a DataFrame with this schema?
Apache Spark
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType
spark = SparkSession.builder.getOrCreate()
schema = StructType([
    StructField('id', IntegerType(), True),
    StructField('value', IntegerType())
])
data = [(1, 100), (2, 200)]
df = spark.createDataFrame(data, schema=schema)
ATypeError: StructField() missing 1 required positional argument: 'nullable'
BValueError: Data does not match schema
CNo error, DataFrame created successfully
DSyntaxError: invalid syntax
Attempts:
2 left
💡 Hint
StructField constructor has nullable=True as the default value.
🧠 Conceptual
advanced
1:30remaining
Understanding schema inference behavior with null values
When Spark infers schema from JSON data containing null values in some fields, what is the expected behavior?
ASpark infers the field as non-nullable regardless of null values
BSpark treats null values as strings and infers the field as string type
CSpark raises an error if null values are present during inference
DSpark infers the data type ignoring nulls and marks the field as nullable
Attempts:
2 left
💡 Hint
Think about how Spark handles missing or null data during schema inference.
🚀 Application
expert
2:30remaining
Choosing schema definition for performance optimization
You have a large CSV dataset with millions of rows. You want to optimize Spark job performance by defining a schema instead of relying on inference. Which schema definition approach is best?
ADefine a StructType schema with exact data types matching the CSV columns and pass it to createDataFrame
BUse schema inference by reading a small sample of the CSV file
CSkip schema definition and let Spark infer schema automatically
DDefine all columns as StringType to avoid type mismatch errors
Attempts:
2 left
💡 Hint
Explicit schema helps Spark avoid scanning data multiple times.