Challenge - 5 Problems
JSON Nested Data Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
❓ Predict Output
intermediate2:00remaining
Output of reading nested JSON with Spark
What is the output of the following Spark code when reading a nested JSON file and selecting a nested field?
Apache Spark
from pyspark.sql import SparkSession spark = SparkSession.builder.appName('Test').getOrCreate() json_data = '''[ {"id": 1, "info": {"name": "Alice", "age": 30}}, {"id": 2, "info": {"name": "Bob", "age": 25}} ]''' rdd = spark.sparkContext.parallelize([json_data]) df = spark.read.json(rdd) df.select('info.name').show()
Attempts:
2 left
💡 Hint
Remember how Spark handles nested fields in select statements.
✗ Incorrect
In Spark, to select a nested field, you must use the dot notation without quotes or use col() function. Using 'info.name' as a string in select causes an AnalysisException.
❓ data_output
intermediate1:30remaining
Count of nested JSON records after flattening
Given a DataFrame loaded from nested JSON, what is the count of rows after flattening the nested array field?
Apache Spark
from pyspark.sql import SparkSession from pyspark.sql.functions import explode spark = SparkSession.builder.appName('Test').getOrCreate() json_data = '''[ {"id": 1, "items": [{"name": "apple"}, {"name": "banana"}]}, {"id": 2, "items": [{"name": "orange"}]} ]''' rdd = spark.sparkContext.parallelize([json_data]) df = spark.read.json(rdd) flat_df = df.select('id', explode('items').alias('item')) count = flat_df.count()
Attempts:
2 left
💡 Hint
Count how many total items are in all nested arrays combined.
✗ Incorrect
The first record has 2 items, the second has 1 item, so total rows after explode are 3.
🔧 Debug
advanced2:00remaining
Identify the error when reading malformed nested JSON
What error will this Spark code produce when reading a malformed nested JSON string?
Apache Spark
from pyspark.sql import SparkSession spark = SparkSession.builder.appName('Test').getOrCreate() json_data = '''[ {"id": 1, "info": {"name": "Alice", "age": 30}, {"id": 2, "info": {"name": "Bob", "age": 25}} ]''' rdd = spark.sparkContext.parallelize([json_data]) df = spark.read.json(rdd) df.show()
Attempts:
2 left
💡 Hint
Check the JSON syntax carefully for missing commas or brackets.
✗ Incorrect
The JSON string is missing a comma between the two objects, causing Spark to detect malformed records and raise AnalysisException.
🚀 Application
advanced2:00remaining
Extract nested fields and create new columns
Which code snippet correctly extracts the nested 'age' field from 'info' and creates a new column 'age' in the DataFrame?
Apache Spark
from pyspark.sql import SparkSession spark = SparkSession.builder.appName('Test').getOrCreate() json_data = '''[ {"id": 1, "info": {"name": "Alice", "age": 30}}, {"id": 2, "info": {"name": "Bob", "age": 25}} ]''' rdd = spark.sparkContext.parallelize([json_data]) df = spark.read.json(rdd)
Attempts:
2 left
💡 Hint
Use dot notation to access nested fields in Spark DataFrame columns.
✗ Incorrect
In Spark, nested fields can be accessed using dot notation like df.info.age. Other options cause errors or do not work.
🧠 Conceptual
expert2:30remaining
Understanding schema inference for nested JSON in Spark
When Spark reads a nested JSON file without a predefined schema, what is the behavior of schema inference for nested fields?
Attempts:
2 left
💡 Hint
Think about how Spark handles complex JSON structures by default.
✗ Incorrect
Spark's JSON reader automatically infers nested schemas recursively, detecting nested structs, arrays, and their types.