Recall & Review
beginner
What is a DataFrame in Apache Spark?
A DataFrame is a table-like data structure in Spark that holds data in rows and columns, similar to a spreadsheet or SQL table. It allows easy data manipulation and analysis.
Click to reveal answer
beginner
How do you read a CSV file into a Spark DataFrame?
Use spark.read.csv('file_path', header=True, inferSchema=True) to load a CSV file. 'header=True' means the first row is column names, and 'inferSchema=True' lets Spark guess data types.
Click to reveal answer
intermediate
What is the difference between reading JSON and Parquet files in Spark?
JSON files are text-based and semi-structured, read with spark.read.json('file_path'). Parquet files are binary, columnar, and optimized for speed, read with spark.read.parquet('file_path').
Click to reveal answer
intermediate
Why use Parquet files instead of CSV or JSON in Spark?
Parquet files are faster to read and use less storage because they store data in a compressed, column-based format. This makes them better for big data processing.
Click to reveal answer
beginner
What does the 'inferSchema' option do when reading CSV files in Spark?
'inferSchema=True' tells Spark to automatically detect the data type of each column instead of treating all columns as strings.
Click to reveal answer
Which Spark method reads a JSON file into a DataFrame?
✗ Incorrect
Use spark.read.json() to load JSON files into a DataFrame.
What option should you set to True to use the first row as column names when reading a CSV?
✗ Incorrect
Setting header=True tells Spark the first row contains column names.
Which file format is columnar and optimized for fast reading in Spark?
✗ Incorrect
Parquet is a columnar storage format designed for fast data processing.
What does 'inferSchema=True' do when reading CSV files?
✗ Incorrect
It lets Spark guess the correct data types for each column.
Which method reads Parquet files in Spark?
✗ Incorrect
spark.read.parquet() loads Parquet files into DataFrames.
Explain how to create a Spark DataFrame from a CSV file including important options.
Think about telling Spark where the file is and how to treat the first row and data types.
You got /4 concepts.
Describe the advantages of using Parquet files over CSV or JSON in Spark.
Focus on speed and storage benefits.
You got /4 concepts.