Challenge - 5 Problems

🎖️

Schema Mastery Badge

Get all challenges correct to earn this badge!

Test your skills under time pressure!

❓ Predict Output

intermediate

2:00remaining

Output of schema inference with mixed data types

What is the output schema of the following Spark DataFrame code snippet?

Apache Spark

from pyspark.sql import SparkSession
from pyspark.sql.types import *
spark = SparkSession.builder.getOrCreate()
data = [(1, 'Alice', 29), (2, 'Bob', 'thirty'), (3, 'Cathy', 25)]
df = spark.createDataFrame(data, ['id', 'name', 'age'])
df.printSchema()

root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)

root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- age: string (nullable = true)

root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- age: long (nullable = true)

root
 |-- id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- age: string (nullable = true)

Attempts:

2 left

❓ data_output

intermediate

1:30remaining

Number of columns after explicit schema definition

Given the explicit schema below, how many columns will the resulting DataFrame have?

Apache Spark

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
spark = SparkSession.builder.getOrCreate()
schema = StructType([
    StructField('name', StringType(), True),
    StructField('age', IntegerType(), True),
    StructField('city', StringType(), True)
])
data = [('Alice', 29, 'NY'), ('Bob', 35, 'LA')]
df = spark.createDataFrame(data, schema=schema)
df.columns

Attempts:

2 left

🔧 Debug

advanced

2:00remaining

Identify the error in schema definition code

What error does the following code raise when creating a DataFrame with this schema?

Apache Spark

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType
spark = SparkSession.builder.getOrCreate()
schema = StructType([
    StructField('id', IntegerType(), True),
    StructField('value', IntegerType())
])
data = [(1, 100), (2, 200)]
df = spark.createDataFrame(data, schema=schema)

ATypeError: StructField() missing 1 required positional argument: 'nullable'

BValueError: Data does not match schema

CNo error, DataFrame created successfully

DSyntaxError: invalid syntax

Attempts:

2 left

🧠 Conceptual

advanced

1:30remaining

Understanding schema inference behavior with null values

When Spark infers schema from JSON data containing null values in some fields, what is the expected behavior?

ASpark infers the field as non-nullable regardless of null values

BSpark treats null values as strings and infers the field as string type

CSpark raises an error if null values are present during inference

DSpark infers the data type ignoring nulls and marks the field as nullable

Attempts:

2 left

🚀 Application

expert

2:30remaining

Choosing schema definition for performance optimization

You have a large CSV dataset with millions of rows. You want to optimize Spark job performance by defining a schema instead of relying on inference. Which schema definition approach is best?

ADefine a StructType schema with exact data types matching the CSV columns and pass it to createDataFrame

BUse schema inference by reading a small sample of the CSV file

CSkip schema definition and let Spark infer schema automatically

DDefine all columns as StringType to avoid type mismatch errors

Attempts:

2 left