0
0
Apache Sparkdata~5 mins

Creating RDDs from collections and files in Apache Spark - Quick Revision & Summary

Choose your learning style9 modes available
Recall & Review
beginner
What is an RDD in Apache Spark?
RDD stands for Resilient Distributed Dataset. It is a basic data structure in Spark that represents a distributed collection of objects that can be processed in parallel.
Click to reveal answer
beginner
How do you create an RDD from a collection in Spark?
You use the SparkContext's parallelize() method to create an RDD from an existing collection like a list or array.
Click to reveal answer
beginner
How do you create an RDD from a file in Spark?
You use the SparkContext's textFile() method and provide the file path. This reads the file and creates an RDD where each element is a line from the file.
Click to reveal answer
intermediate
What is the difference between parallelize() and textFile() when creating RDDs?
parallelize() creates an RDD from an existing collection in memory, while textFile() creates an RDD by reading data from a file stored on disk or distributed storage.
Click to reveal answer
beginner
Why is creating RDDs from files useful in real life?
Because data often comes from files like logs, CSVs, or text files, creating RDDs from files lets you process large datasets stored on disk or cloud storage easily and in parallel.
Click to reveal answer
Which Spark method creates an RDD from a list in memory?
AtextFile()
Bparallelize()
Cread()
Dcollect()
What does textFile() return when called with a file path?
AA DataFrame
BA list of file names
CAn RDD where each element is a line from the file
DA single string with the whole file content
Which of these is NOT a way to create an RDD?
AfromCSV()
BtextFile()
Cparallelize()
DwholeTextFiles()
If you want to process data stored in a text file on disk, which method should you use?
Aparallelize()
Bmap()
Ccollect()
DtextFile()
What type of data does parallelize() accept to create an RDD?
AA collection like a list or array
BA file path
CA database connection
DA SparkSession
Explain how to create an RDD from a collection and from a file in Apache Spark.
Think about where your data lives: memory or disk.
You got /3 concepts.
    Why would you choose to create an RDD from a file instead of a collection?
    Consider data size and source.
    You got /3 concepts.