In Apache Spark, you have two datasets: one stored as CSV files and another as Parquet files. Both contain the same data. Why does reading Parquet files usually result in faster query performance?
Think about how data is stored and how Spark reads only parts of the data it needs.
Parquet is a columnar storage format that compresses data and stores metadata, enabling Spark to skip reading unnecessary columns. CSV is row-based and uncompressed, so Spark reads all data even if only some columns are needed.
You save the same DataFrame in Spark as CSV and Parquet formats. Which option correctly shows the typical size difference between the two files?
df.write.csv('data_csv') df.write.parquet('data_parquet') # After saving, check file sizes of 'data_csv' and 'data_parquet' folders
Consider how compression affects file size.
Parquet files use compression and efficient encoding, so they usually take less disk space than CSV files, which are plain text and uncompressed.
Consider this Spark code snippet:
df_json = spark.read.json('data.json')
df_parquet = spark.read.parquet('data.parquet')
print(df_json.count())
print(df_parquet.count())Assuming both files contain the same data, what will be the output?
df_json = spark.read.json('data.json') df_parquet = spark.read.parquet('data.parquet') print(df_json.count()) print(df_parquet.count())
Think about how Spark reads data formats and counts rows.
Both JSON and Parquet readers load the data and count rows. If the data is the same and valid, counts will match. Performance differs but output count is the same.
You run the same query on datasets stored as CSV, JSON, and Parquet. You want to visualize query execution time for each format. Which plot type best shows this comparison clearly?
Think about comparing categories with numeric values.
A bar chart clearly compares execution times across different data formats side by side. Pie charts and scatter plots are less clear for this purpose.
You have this Spark code:
df = spark.read.csv('data.csv', header=True)
df_filtered = df.filter(df['age'] > 30)
df_filtered.show()The same filter on a Parquet file runs much faster. What is the main reason for this difference?
Consider how Spark optimizes queries differently for formats.
Parquet stores metadata and column statistics enabling filter pushdown, so Spark reads only relevant data. CSV lacks this, so Spark scans entire files, making it slower.