Recall & Review
beginner
What is an RDD in Apache Spark?
RDD stands for Resilient Distributed Dataset. It is a basic data structure in Spark that represents a distributed collection of objects that can be processed in parallel.
Click to reveal answer
beginner
How do you create an RDD from a collection in Spark?
You use the SparkContext's
parallelize() method to create an RDD from an existing collection like a list or array.Click to reveal answer
beginner
How do you create an RDD from a file in Spark?
You use the SparkContext's
textFile() method and provide the file path. This reads the file and creates an RDD where each element is a line from the file.Click to reveal answer
intermediate
What is the difference between
parallelize() and textFile() when creating RDDs?parallelize() creates an RDD from an existing collection in memory, while textFile() creates an RDD by reading data from a file stored on disk or distributed storage.Click to reveal answer
beginner
Why is creating RDDs from files useful in real life?
Because data often comes from files like logs, CSVs, or text files, creating RDDs from files lets you process large datasets stored on disk or cloud storage easily and in parallel.
Click to reveal answer
Which Spark method creates an RDD from a list in memory?
✗ Incorrect
parallelize() is used to create an RDD from an existing collection like a list.What does
textFile() return when called with a file path?✗ Incorrect
textFile() reads the file and creates an RDD with each line as an element.Which of these is NOT a way to create an RDD?
✗ Incorrect
fromCSV() is not a Spark method to create RDDs; CSV files are usually read into DataFrames.If you want to process data stored in a text file on disk, which method should you use?
✗ Incorrect
textFile() reads data from a file and creates an RDD.What type of data does
parallelize() accept to create an RDD?✗ Incorrect
parallelize() takes a collection in memory to create an RDD.Explain how to create an RDD from a collection and from a file in Apache Spark.
Think about where your data lives: memory or disk.
You got /3 concepts.
Why would you choose to create an RDD from a file instead of a collection?
Consider data size and source.
You got /3 concepts.