Challenge - 5 Problems
File Format Mastery
Get all challenges correct to earn this badge!
Test your skills under time pressure!
❓ Predict Output
intermediate2:00remaining
Output of reading CSV with header option
What will be the output schema of the DataFrame after running this code snippet in Apache Spark?
Apache Spark
from pyspark.sql import SparkSession spark = SparkSession.builder.appName('Test').getOrCreate() df = spark.read.csv('data.csv', header=True, inferSchema=True) df.printSchema()
Attempts:
2 left
💡 Hint
Check the effect of header=True and inferSchema=True when reading CSV files.
✗ Incorrect
When header=True, Spark uses the first row as column names. inferSchema=True lets Spark detect data types automatically. So columns get proper names and types.
❓ data_output
intermediate2:00remaining
DataFrame content after reading JSON file
Given a JSON file with records [{"name":"Alice","age":30},{"name":"Bob","age":25}], what will be the output of df.show() after reading it with spark.read.json('people.json')?
Apache Spark
from pyspark.sql import SparkSession spark = SparkSession.builder.appName('Test').getOrCreate() df = spark.read.json('people.json') df.show()
Attempts:
2 left
💡 Hint
Spark reads JSON and infers schema and types automatically.
✗ Incorrect
Spark reads JSON records and infers the schema with correct types. Numbers remain integers, strings remain strings.
❓ visualization
advanced1:30remaining
Visualizing Parquet file schema
You load a Parquet file with spark.read.parquet('data.parquet'). Which Spark method will show a tree-like schema visualization of the DataFrame?
Apache Spark
from pyspark.sql import SparkSession spark = SparkSession.builder.appName('Test').getOrCreate() df = spark.read.parquet('data.parquet') # Which method to call next?
Attempts:
2 left
💡 Hint
The method name is printSchema, not showSchema or others.
✗ Incorrect
printSchema() prints the schema in a tree format. Other options do not exist or cause errors.
🔧 Debug
advanced2:00remaining
Error when reading CSV without header option
What error or issue will occur if you run this code and the CSV file has a header row but you do NOT specify header=True?
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Test').getOrCreate()
df = spark.read.csv('data.csv')
df.show()
Apache Spark
from pyspark.sql import SparkSession spark = SparkSession.builder.appName('Test').getOrCreate() df = spark.read.csv('data.csv') df.show()
Attempts:
2 left
💡 Hint
By default, header=False, so first row is data.
✗ Incorrect
Without header=True, Spark treats the first row as data and assigns default column names like _c0, _c1.
🚀 Application
expert3:00remaining
Choosing file format for large dataset with nested data
You have a large dataset with nested JSON structures and want to store it efficiently for fast queries in Spark. Which file format should you choose?
Attempts:
2 left
💡 Hint
Think about file format efficiency and support for nested data.
✗ Incorrect
Parquet is a columnar storage format that supports nested data and compression, making it efficient for large datasets and fast queries.