Challenge - 5 Problems

🎖️

JSON Nested Data Master

Get all challenges correct to earn this badge!

Test your skills under time pressure!

❓ Predict Output

intermediate

2:00remaining

Output of reading nested JSON with Spark

What is the output of the following Spark code when reading a nested JSON file and selecting a nested field?

Apache Spark

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('Test').getOrCreate()

json_data = '''[
  {"id": 1, "info": {"name": "Alice", "age": 30}},
  {"id": 2, "info": {"name": "Bob", "age": 25}}
]'''

rdd = spark.sparkContext.parallelize([json_data])
df = spark.read.json(rdd)
df.select('info.name').show()

AAnalysisException: cannot resolve 'info.name' given input columns: [id, info]

+----------+
|info.name |
+----------+
|Alice     |
|Bob       |
+----------+

+----------+
|info.name |
+----------+
|null      |
|null      |
+----------+

+----------+
|name      |
+----------+
|Alice     |
|Bob       |
+----------+

Attempts:

2 left

❓ data_output

intermediate

1:30remaining

Count of nested JSON records after flattening

Given a DataFrame loaded from nested JSON, what is the count of rows after flattening the nested array field?

Apache Spark

from pyspark.sql import SparkSession
from pyspark.sql.functions import explode

spark = SparkSession.builder.appName('Test').getOrCreate()

json_data = '''[
  {"id": 1, "items": [{"name": "apple"}, {"name": "banana"}]},
  {"id": 2, "items": [{"name": "orange"}]}
]'''
rdd = spark.sparkContext.parallelize([json_data])
df = spark.read.json(rdd)

flat_df = df.select('id', explode('items').alias('item'))
count = flat_df.count()

Attempts:

2 left

🔧 Debug

advanced

2:00remaining

Identify the error when reading malformed nested JSON

What error will this Spark code produce when reading a malformed nested JSON string?

Apache Spark

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('Test').getOrCreate()

json_data = '''[
  {"id": 1, "info": {"name": "Alice", "age": 30},
  {"id": 2, "info": {"name": "Bob", "age": 25}}
]'''

rdd = spark.sparkContext.parallelize([json_data])
df = spark.read.json(rdd)
df.show()

AValueError: Expecting ',' delimiter

BAnalysisException: Malformed records are detected in record parsing

CNo error, DataFrame shows both records correctly

DTypeError: 'str' object is not iterable

Attempts:

2 left

🚀 Application

advanced

2:00remaining

Extract nested fields and create new columns

Which code snippet correctly extracts the nested 'age' field from 'info' and creates a new column 'age' in the DataFrame?

Apache Spark

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('Test').getOrCreate()

json_data = '''[
  {"id": 1, "info": {"name": "Alice", "age": 30}},
  {"id": 2, "info": {"name": "Bob", "age": 25}}
]'''
rdd = spark.sparkContext.parallelize([json_data])
df = spark.read.json(rdd)

Adf.withColumn('age', df['info.age']).show()

Bdf.withColumn('age', df.info['age']).show()

Cdf.withColumn('age', df['info']['age']).show()

Ddf.withColumn('age', df.info.age).show()

Attempts:

2 left

🧠 Conceptual

expert

2:30remaining

Understanding schema inference for nested JSON in Spark

When Spark reads a nested JSON file without a predefined schema, what is the behavior of schema inference for nested fields?

ASpark requires a user-defined schema to read nested JSON; otherwise, it raises an error.

BSpark infers only the top-level fields and treats nested fields as strings.

CSpark infers the schema recursively, including all nested fields and their types automatically.

DSpark infers the schema but ignores nested arrays and maps, flattening them into strings.

Attempts:

2 left