Parquet vs CSV vs JSON in Spark: Key Differences and Usage
Parquet is a fast, compressed columnar format ideal for big data processing, while CSV is a simple, plain-text row-based format best for small or simple data. JSON supports nested data but is slower and larger than Parquet, making it suitable for semi-structured data but less efficient for large-scale analytics.Quick Comparison
Here is a quick comparison of Parquet, CSV, and JSON formats in Spark based on key factors.
| Factor | Parquet | CSV | JSON |
|---|---|---|---|
| Data Type Support | Supports complex nested types and schema | Plain text, no schema enforcement | Supports nested and semi-structured data |
| Storage Format | Columnar, compressed | Row-based, plain text | Row-based, plain text with structure |
| Performance | Fast reads and writes, optimized for analytics | Slower, no compression by default | Slower due to parsing and size |
| File Size | Smaller due to compression | Larger, no compression | Larger, verbose format |
| Schema Enforcement | Schema stored with data | No schema, user must define | No strict schema, inferred at runtime |
| Use Case | Big data analytics, data lakes | Simple data exchange, small files | Semi-structured data, logs, APIs |
Key Differences
Parquet is a columnar storage format designed for efficient data compression and encoding schemes. It stores metadata and schema with the data, which helps Spark optimize query execution by reading only needed columns. This makes it very fast and space-efficient for large datasets.
CSV is a simple text format where each row is a line and columns are separated by commas. It does not store schema or data types, so Spark must infer schema or rely on user input. CSV files are easy to read and write but are inefficient for large-scale processing due to lack of compression and slower parsing.
JSON supports nested and complex data structures, making it useful for semi-structured data. However, JSON files are verbose and require more CPU to parse, which slows down processing in Spark. JSON also lacks built-in schema enforcement, so schema inference can be costly and error-prone.
Code Comparison
Below is an example of reading and writing a DataFrame in Spark using the Parquet format.
from pyspark.sql import SparkSession spark = SparkSession.builder.appName("ParquetExample").getOrCreate() # Create sample data data = [(1, "Alice", 29), (2, "Bob", 31), (3, "Cathy", 25)] columns = ["id", "name", "age"] df = spark.createDataFrame(data, columns) # Write DataFrame as Parquet parquet_path = "/tmp/sample_parquet" df.write.mode("overwrite").parquet(parquet_path) # Read Parquet file df_parquet = spark.read.parquet(parquet_path) df_parquet.show()
CSV Equivalent
Here is the equivalent code for reading and writing the same DataFrame using the CSV format in Spark.
from pyspark.sql import SparkSession spark = SparkSession.builder.appName("CSVExample").getOrCreate() # Create sample data data = [(1, "Alice", 29), (2, "Bob", 31), (3, "Cathy", 25)] columns = ["id", "name", "age"] df = spark.createDataFrame(data, columns) # Write DataFrame as CSV csv_path = "/tmp/sample_csv" df.write.mode("overwrite").option("header", "true").csv(csv_path) # Read CSV file # Schema must be inferred or specified df_csv = spark.read.option("header", "true").option("inferSchema", "true").csv(csv_path) df_csv.show()
When to Use Which
Choose Parquet when working with large datasets that require fast analytics and efficient storage, especially in data lakes or big data pipelines.
Choose CSV for simple, small datasets or when you need easy human readability and compatibility with many tools, but expect slower performance and larger files.
Choose JSON when dealing with semi-structured or nested data, such as logs or API data, but be aware of slower processing and larger file sizes compared to Parquet.