Recall & Review
beginner
What is a data format in the context of Apache Spark?
A data format is the way data is stored and organized on disk or in memory, such as CSV, JSON, Parquet, or ORC. It affects how Spark reads and writes data.
Click to reveal answer
beginner
Why does using a columnar data format like Parquet improve performance in Spark?
Columnar formats store data by columns, so Spark can read only the needed columns, reducing disk I/O and speeding up queries.
Click to reveal answer
intermediate
How does compression in data formats affect Spark performance?
Compression reduces file size, lowering disk and network usage. However, decompressing data uses CPU, so the right balance improves overall speed.
Click to reveal answer
intermediate
What is predicate pushdown and how does it relate to data formats?
Predicate pushdown lets Spark filter data early during reading, reducing data scanned. Formats like Parquet and ORC support this, improving performance.
Click to reveal answer
beginner
Why might JSON or CSV formats be slower than Parquet in Spark?
JSON and CSV are row-based and text formats, requiring more parsing and reading all data even if only some columns are needed, making them slower.
Click to reveal answer
Which data format is typically fastest for analytical queries in Spark?
✗ Incorrect
Parquet is a columnar format optimized for analytical queries, making it faster than CSV, JSON, or plain text.
What feature allows Spark to read only necessary columns from a dataset?
✗ Incorrect
Column pruning lets Spark read only the columns needed, reducing data read and improving speed.
How does compression affect Spark's data processing?
✗ Incorrect
Compression reduces file size, saving disk and network time, but decompressing uses CPU, so it can speed up or slow down depending on balance.
Which data format supports predicate pushdown in Spark?
✗ Incorrect
Parquet supports predicate pushdown, allowing Spark to filter data early and improve performance.
Why is reading JSON slower than Parquet in Spark?
✗ Incorrect
JSON is a text format that requires parsing and does not support column pruning, so Spark reads all data, making it slower.
Explain how data format choice affects Spark query performance.
Think about how Spark reads data and what features help reduce work.
You got /4 concepts.
Describe why Parquet is often preferred over CSV or JSON for big data analytics in Spark.
Focus on how Parquet saves time and resources during queries.
You got /4 concepts.