0
0
Apache Sparkdata~5 mins

Why data format affects performance in Apache Spark - Quick Recap

Choose your learning style9 modes available
Recall & Review
beginner
What is a data format in the context of Apache Spark?
A data format is the way data is stored and organized on disk or in memory, such as CSV, JSON, Parquet, or ORC. It affects how Spark reads and writes data.
Click to reveal answer
beginner
Why does using a columnar data format like Parquet improve performance in Spark?
Columnar formats store data by columns, so Spark can read only the needed columns, reducing disk I/O and speeding up queries.
Click to reveal answer
intermediate
How does compression in data formats affect Spark performance?
Compression reduces file size, lowering disk and network usage. However, decompressing data uses CPU, so the right balance improves overall speed.
Click to reveal answer
intermediate
What is predicate pushdown and how does it relate to data formats?
Predicate pushdown lets Spark filter data early during reading, reducing data scanned. Formats like Parquet and ORC support this, improving performance.
Click to reveal answer
beginner
Why might JSON or CSV formats be slower than Parquet in Spark?
JSON and CSV are row-based and text formats, requiring more parsing and reading all data even if only some columns are needed, making them slower.
Click to reveal answer
Which data format is typically fastest for analytical queries in Spark?
AJSON
BCSV
CParquet
DTXT
What feature allows Spark to read only necessary columns from a dataset?
APredicate pushdown
BColumn pruning
CCompression
DPartitioning
How does compression affect Spark's data processing?
AReduces file size but may add CPU overhead
BAlways slows down processing
CIncreases file size
DHas no effect
Which data format supports predicate pushdown in Spark?
AParquet
BCSV
CJSON
DTXT
Why is reading JSON slower than Parquet in Spark?
AJSON is binary format
BJSON files are always compressed
CJSON supports column pruning
DJSON requires parsing text and reads all data
Explain how data format choice affects Spark query performance.
Think about how Spark reads data and what features help reduce work.
You got /4 concepts.
    Describe why Parquet is often preferred over CSV or JSON for big data analytics in Spark.
    Focus on how Parquet saves time and resources during queries.
    You got /4 concepts.