beginner

What is a data format in the context of Apache Spark?

A data format is the way data is stored and organized on disk or in memory, such as CSV, JSON, Parquet, or ORC. It affects how Spark reads and writes data.

Click to reveal answer

beginner

Why does using a columnar data format like Parquet improve performance in Spark?

Columnar formats store data by columns, so Spark can read only the needed columns, reducing disk I/O and speeding up queries.

Click to reveal answer

intermediate

How does compression in data formats affect Spark performance?

Compression reduces file size, lowering disk and network usage. However, decompressing data uses CPU, so the right balance improves overall speed.

Click to reveal answer

intermediate

What is predicate pushdown and how does it relate to data formats?

Predicate pushdown lets Spark filter data early during reading, reducing data scanned. Formats like Parquet and ORC support this, improving performance.

Click to reveal answer

beginner

Why might JSON or CSV formats be slower than Parquet in Spark?

JSON and CSV are row-based and text formats, requiring more parsing and reading all data even if only some columns are needed, making them slower.

Click to reveal answer

Which data format is typically fastest for analytical queries in Spark?

AJSON

BCSV

CParquet

DTXT

What feature allows Spark to read only necessary columns from a dataset?

APredicate pushdown

BColumn pruning

CCompression

DPartitioning

How does compression affect Spark's data processing?

AReduces file size but may add CPU overhead

BAlways slows down processing

CIncreases file size

DHas no effect

Which data format supports predicate pushdown in Spark?

AParquet

BCSV

CJSON

DTXT

Why is reading JSON slower than Parquet in Spark?

AJSON is binary format

BJSON files are always compressed

CJSON supports column pruning

DJSON requires parsing text and reads all data

Explain how data format choice affects Spark query performance.

Describe why Parquet is often preferred over CSV or JSON for big data analytics in Spark.