0
0
Apache Sparkdata~20 mins

Reading JSON and nested data in Apache Spark - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
JSON Nested Data Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
Predict Output
intermediate
2:00remaining
Output of reading nested JSON with Spark
What is the output of the following Spark code when reading a nested JSON file and selecting a nested field?
Apache Spark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('Test').getOrCreate()

json_data = '''[
  {"id": 1, "info": {"name": "Alice", "age": 30}},
  {"id": 2, "info": {"name": "Bob", "age": 25}}
]'''

rdd = spark.sparkContext.parallelize([json_data])
df = spark.read.json(rdd)
df.select('info.name').show()
AAnalysisException: cannot resolve 'info.name' given input columns: [id, info]
B
+----------+
|info.name |
+----------+
|Alice     |
|Bob       |
+----------+
C
+----------+
|info.name |
+----------+
|null      |
|null      |
+----------+
D
+----------+
|name      |
+----------+
|Alice     |
|Bob       |
+----------+
Attempts:
2 left
💡 Hint
Remember how Spark handles nested fields in select statements.
data_output
intermediate
1:30remaining
Count of nested JSON records after flattening
Given a DataFrame loaded from nested JSON, what is the count of rows after flattening the nested array field?
Apache Spark
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode

spark = SparkSession.builder.appName('Test').getOrCreate()

json_data = '''[
  {"id": 1, "items": [{"name": "apple"}, {"name": "banana"}]},
  {"id": 2, "items": [{"name": "orange"}]}
]'''
rdd = spark.sparkContext.parallelize([json_data])
df = spark.read.json(rdd)

flat_df = df.select('id', explode('items').alias('item'))
count = flat_df.count()
A3
B1
C2
D0
Attempts:
2 left
💡 Hint
Count how many total items are in all nested arrays combined.
🔧 Debug
advanced
2:00remaining
Identify the error when reading malformed nested JSON
What error will this Spark code produce when reading a malformed nested JSON string?
Apache Spark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('Test').getOrCreate()

json_data = '''[
  {"id": 1, "info": {"name": "Alice", "age": 30},
  {"id": 2, "info": {"name": "Bob", "age": 25}}
]'''

rdd = spark.sparkContext.parallelize([json_data])
df = spark.read.json(rdd)
df.show()
AValueError: Expecting ',' delimiter
BAnalysisException: Malformed records are detected in record parsing
CNo error, DataFrame shows both records correctly
DTypeError: 'str' object is not iterable
Attempts:
2 left
💡 Hint
Check the JSON syntax carefully for missing commas or brackets.
🚀 Application
advanced
2:00remaining
Extract nested fields and create new columns
Which code snippet correctly extracts the nested 'age' field from 'info' and creates a new column 'age' in the DataFrame?
Apache Spark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('Test').getOrCreate()

json_data = '''[
  {"id": 1, "info": {"name": "Alice", "age": 30}},
  {"id": 2, "info": {"name": "Bob", "age": 25}}
]'''
rdd = spark.sparkContext.parallelize([json_data])
df = spark.read.json(rdd)
Adf.withColumn('age', df['info.age']).show()
Bdf.withColumn('age', df.info['age']).show()
Cdf.withColumn('age', df['info']['age']).show()
Ddf.withColumn('age', df.info.age).show()
Attempts:
2 left
💡 Hint
Use dot notation to access nested fields in Spark DataFrame columns.
🧠 Conceptual
expert
2:30remaining
Understanding schema inference for nested JSON in Spark
When Spark reads a nested JSON file without a predefined schema, what is the behavior of schema inference for nested fields?
ASpark requires a user-defined schema to read nested JSON; otherwise, it raises an error.
BSpark infers only the top-level fields and treats nested fields as strings.
CSpark infers the schema recursively, including all nested fields and their types automatically.
DSpark infers the schema but ignores nested arrays and maps, flattening them into strings.
Attempts:
2 left
💡 Hint
Think about how Spark handles complex JSON structures by default.