Challenge - 5 Problems

🎖️

File Format Mastery

Get all challenges correct to earn this badge!

Test your skills under time pressure!

❓ Predict Output

intermediate

2:00remaining

Output of reading CSV with header option

What will be the output schema of the DataFrame after running this code snippet in Apache Spark?

Apache Spark

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Test').getOrCreate()
df = spark.read.csv('data.csv', header=True, inferSchema=True)
df.printSchema()

root
 |-- column1: string (nullable = true)
 |-- column2: integer (nullable = true)

root
 |-- _c0: integer (nullable = true)
 |-- _c1: integer (nullable = true)

root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)

root
 |-- column1: integer (nullable = true)
 |-- column2: string (nullable = true)

Attempts:

2 left

❓ data_output

intermediate

2:00remaining

DataFrame content after reading JSON file

Given a JSON file with records [{"name":"Alice","age":30},{"name":"Bob","age":25}], what will be the output of df.show() after reading it with spark.read.json('people.json')?

Apache Spark

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Test').getOrCreate()
df = spark.read.json('people.json')
df.show()

+-----+---+
| name|age|
+-----+---+
|  30 |Alice|
|  25 | Bob|
+-----+---+

+-----+---+
| name|age|
+-----+---+
|Alice| 30|
|  Bob| 25|
+-----+---+

+-----+---+
| name|age|
+-----+---+
|Alice|"30"|
|  Bob|"25"|
+-----+---+

+-----+---+
| name|age|
+-----+---+
|Alice|null|
|  Bob|null|
+-----+---+

Attempts:

2 left

❓ visualization

advanced

1:30remaining

Visualizing Parquet file schema

You load a Parquet file with spark.read.parquet('data.parquet'). Which Spark method will show a tree-like schema visualization of the DataFrame?

Apache Spark

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Test').getOrCreate()
df = spark.read.parquet('data.parquet')
# Which method to call next?

Adf.schemaTree()

Bdf.showSchema()

Cdf.printSchema()

Ddf.displaySchema()

Attempts:

2 left

🔧 Debug

advanced

2:00remaining

Error when reading CSV without header option

What error or issue will occur if you run this code and the CSV file has a header row but you do NOT specify header=True? from pyspark.sql import SparkSession spark = SparkSession.builder.appName('Test').getOrCreate() df = spark.read.csv('data.csv') df.show()

Apache Spark

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Test').getOrCreate()
df = spark.read.csv('data.csv')
df.show()

AThe DataFrame will be empty because header is not read.

BSpark will raise a FileNotFoundError because header is missing.

CSpark will raise a ValueError about missing header option.

DThe first row will be treated as data, not header, so column names will be _c0, _c1, etc.

Attempts:

2 left

🚀 Application

expert

3:00remaining

Choosing file format for large dataset with nested data

You have a large dataset with nested JSON structures and want to store it efficiently for fast queries in Spark. Which file format should you choose?

AParquet, because it supports nested data and is columnar for fast queries

BCSV, because it is simple and widely supported

CJSON, because it preserves nested structures naturally

DTXT, because it is easy to read and write

Attempts:

2 left