0
0
Apache Sparkdata~20 mins

Creating DataFrames from files (CSV, JSON, Parquet) in Apache Spark - Practice Exercises

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
File Format Mastery
Get all challenges correct to earn this badge!
Test your skills under time pressure!
Predict Output
intermediate
2:00remaining
Output of reading CSV with header option
What will be the output schema of the DataFrame after running this code snippet in Apache Spark?
Apache Spark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Test').getOrCreate()
df = spark.read.csv('data.csv', header=True, inferSchema=True)
df.printSchema()
A
root
 |-- column1: string (nullable = true)
 |-- column2: integer (nullable = true)
B
root
 |-- _c0: integer (nullable = true)
 |-- _c1: integer (nullable = true)
C
root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)
D
root
 |-- column1: integer (nullable = true)
 |-- column2: string (nullable = true)
Attempts:
2 left
💡 Hint
Check the effect of header=True and inferSchema=True when reading CSV files.
data_output
intermediate
2:00remaining
DataFrame content after reading JSON file
Given a JSON file with records [{"name":"Alice","age":30},{"name":"Bob","age":25}], what will be the output of df.show() after reading it with spark.read.json('people.json')?
Apache Spark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Test').getOrCreate()
df = spark.read.json('people.json')
df.show()
A
+-----+---+
| name|age|
+-----+---+
|  30 |Alice|
|  25 | Bob|
+-----+---+
B
+-----+---+
| name|age|
+-----+---+
|Alice| 30|
|  Bob| 25|
+-----+---+
C
+-----+---+
| name|age|
+-----+---+
|Alice|"30"|
|  Bob|"25"|
+-----+---+
D
+-----+---+
| name|age|
+-----+---+
|Alice|null|
|  Bob|null|
+-----+---+
Attempts:
2 left
💡 Hint
Spark reads JSON and infers schema and types automatically.
visualization
advanced
1:30remaining
Visualizing Parquet file schema
You load a Parquet file with spark.read.parquet('data.parquet'). Which Spark method will show a tree-like schema visualization of the DataFrame?
Apache Spark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Test').getOrCreate()
df = spark.read.parquet('data.parquet')
# Which method to call next?
Adf.schemaTree()
Bdf.showSchema()
Cdf.printSchema()
Ddf.displaySchema()
Attempts:
2 left
💡 Hint
The method name is printSchema, not showSchema or others.
🔧 Debug
advanced
2:00remaining
Error when reading CSV without header option
What error or issue will occur if you run this code and the CSV file has a header row but you do NOT specify header=True? from pyspark.sql import SparkSession spark = SparkSession.builder.appName('Test').getOrCreate() df = spark.read.csv('data.csv') df.show()
Apache Spark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Test').getOrCreate()
df = spark.read.csv('data.csv')
df.show()
AThe DataFrame will be empty because header is not read.
BSpark will raise a FileNotFoundError because header is missing.
CSpark will raise a ValueError about missing header option.
DThe first row will be treated as data, not header, so column names will be _c0, _c1, etc.
Attempts:
2 left
💡 Hint
By default, header=False, so first row is data.
🚀 Application
expert
3:00remaining
Choosing file format for large dataset with nested data
You have a large dataset with nested JSON structures and want to store it efficiently for fast queries in Spark. Which file format should you choose?
AParquet, because it supports nested data and is columnar for fast queries
BCSV, because it is simple and widely supported
CJSON, because it preserves nested structures naturally
DTXT, because it is easy to read and write
Attempts:
2 left
💡 Hint
Think about file format efficiency and support for nested data.