beginner

What is an RDD in Apache Spark?

RDD stands for Resilient Distributed Dataset. It is a basic data structure in Spark that represents a distributed collection of objects that can be processed in parallel.

Click to reveal answer

beginner

How do you create an RDD from a collection in Spark?

You use the SparkContext's parallelize() method to create an RDD from an existing collection like a list or array.

Click to reveal answer

beginner

How do you create an RDD from a file in Spark?

You use the SparkContext's textFile() method and provide the file path. This reads the file and creates an RDD where each element is a line from the file.

Click to reveal answer

intermediate

What is the difference between parallelize() and textFile() when creating RDDs?

parallelize() creates an RDD from an existing collection in memory, while textFile() creates an RDD by reading data from a file stored on disk or distributed storage.

Click to reveal answer

beginner

Why is creating RDDs from files useful in real life?

Because data often comes from files like logs, CSVs, or text files, creating RDDs from files lets you process large datasets stored on disk or cloud storage easily and in parallel.

Click to reveal answer

Which Spark method creates an RDD from a list in memory?

AtextFile()

Bparallelize()

Cread()

Dcollect()

What does textFile() return when called with a file path?

AA DataFrame

BA list of file names

CAn RDD where each element is a line from the file

DA single string with the whole file content

Which of these is NOT a way to create an RDD?

AfromCSV()

BtextFile()

Cparallelize()

DwholeTextFiles()

If you want to process data stored in a text file on disk, which method should you use?

Aparallelize()

Bmap()

Ccollect()

DtextFile()

What type of data does parallelize() accept to create an RDD?

AA collection like a list or array

BA file path

CA database connection

DA SparkSession

Explain how to create an RDD from a collection and from a file in Apache Spark.

Why would you choose to create an RDD from a file instead of a collection?