beginner

What is a DataFrame in Apache Spark?

A DataFrame is a table-like data structure in Spark that holds data in rows and columns, similar to a spreadsheet or SQL table. It allows easy data manipulation and analysis.

Click to reveal answer

beginner

How do you read a CSV file into a Spark DataFrame?

Use spark.read.csv('file_path', header=True, inferSchema=True) to load a CSV file. 'header=True' means the first row is column names, and 'inferSchema=True' lets Spark guess data types.

Click to reveal answer

intermediate

What is the difference between reading JSON and Parquet files in Spark?

JSON files are text-based and semi-structured, read with spark.read.json('file_path'). Parquet files are binary, columnar, and optimized for speed, read with spark.read.parquet('file_path').

Click to reveal answer

intermediate

Why use Parquet files instead of CSV or JSON in Spark?

Parquet files are faster to read and use less storage because they store data in a compressed, column-based format. This makes them better for big data processing.

Click to reveal answer

beginner

What does the 'inferSchema' option do when reading CSV files in Spark?

'inferSchema=True' tells Spark to automatically detect the data type of each column instead of treating all columns as strings.

Click to reveal answer

Which Spark method reads a JSON file into a DataFrame?

Aspark.read.parquet('file_path')

Bspark.read.json('file_path')

Cspark.read.csv('file_path')

Dspark.read.text('file_path')

What option should you set to True to use the first row as column names when reading a CSV?

AinferSchema=True

Bdelimiter=','

Cheader=True

Dschema=True

Which file format is columnar and optimized for fast reading in Spark?

AParquet

BJSON

CCSV

DTXT

What does 'inferSchema=True' do when reading CSV files?

AAutomatically detects column data types

BUses the first row as data

CSkips empty rows

DReads the file faster

Which method reads Parquet files in Spark?

Aspark.read.text()

Bspark.read.json()

Cspark.read.csv()

Dspark.read.parquet()

Explain how to create a Spark DataFrame from a CSV file including important options.

Describe the advantages of using Parquet files over CSV or JSON in Spark.