Apache Sparkdata~30 mins

Parquet format and columnar storage in Apache Spark - Mini Project: Build & Apply

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Explore Parquet Format and Columnar Storage with Apache Spark

📖 Scenario: You work as a data analyst at a retail company. You have sales data in a simple list format. Your manager wants you to save this data efficiently using Parquet format, which stores data in columns. This helps save space and speeds up queries.

🎯 Goal: Learn how to create a Spark DataFrame from a list, save it as a Parquet file, read it back, and understand the benefits of columnar storage.

📋 What You'll Learn

Create a Spark DataFrame from a list of sales records

Define a configuration variable for the Parquet file path

Save the DataFrame as a Parquet file

Read the Parquet file back into a DataFrame

Print the loaded DataFrame to see the data

💡 Why This Matters

🌍 Real World

Parquet is widely used in big data systems to store large datasets efficiently. It helps companies save storage and speed up data analysis.

💼 Career

Data engineers and data scientists use Parquet format to handle large datasets in distributed systems like Apache Spark.

Progress0 / 4 steps

Create a Spark DataFrame with sales data

Create a list called sales_data with these exact entries: ("2024-01-01", "StoreA", 100), ("2024-01-02", "StoreB", 150), ("2024-01-03", "StoreA", 200). Then create a Spark DataFrame called df from sales_data with columns "date", "store", and "sales".

Apache Spark

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("ParquetExample").getOrCreate()
# Create the list sales_data with the given tuples
# Create the DataFrame df from sales_data with columns 'date', 'store', 'sales'

Need a hint?

Use spark.createDataFrame() with your list and specify the column names as a list.

Set the Parquet file path

Create a string variable called parquet_path and set it to "sales_data.parquet".

Apache Spark

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("ParquetExample").getOrCreate()
sales_data = [("2024-01-01", "StoreA", 100), ("2024-01-02", "StoreB", 150), ("2024-01-03", "StoreA", 200)]
df = spark.createDataFrame(sales_data, ["date", "store", "sales"])
# Create the variable parquet_path and set it to "sales_data.parquet"

Need a hint?

Just assign the string "sales_data.parquet" to the variable parquet_path.

Save the DataFrame as a Parquet file

Use the DataFrame df to save the data as a Parquet file at the path stored in parquet_path. Use the .write.parquet() method.

Apache Spark

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("ParquetExample").getOrCreate()
sales_data = [("2024-01-01", "StoreA", 100), ("2024-01-02", "StoreB", 150), ("2024-01-03", "StoreA", 200)]
df = spark.createDataFrame(sales_data, ["date", "store", "sales"])
parquet_path = "sales_data.parquet"
# Save df as a Parquet file at parquet_path

Need a hint?

Use df.write.parquet(parquet_path) to save the DataFrame in Parquet format.

Read the Parquet file and display the data

Read the Parquet file from parquet_path into a new DataFrame called df_loaded. Then use print() with df_loaded.show() to display the data.

Apache Spark

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("ParquetExample").getOrCreate()
sales_data = [("2024-01-01", "StoreA", 100), ("2024-01-02", "StoreB", 150), ("2024-01-03", "StoreA", 200)]
df = spark.createDataFrame(sales_data, ["date", "store", "sales"])
parquet_path = "sales_data.parquet"
df.write.parquet(parquet_path)
# Read the Parquet file into df_loaded
# Print the contents of df_loaded using df_loaded.show()

Need a hint?

Use spark.read.parquet(parquet_path) to read the file. Then call df_loaded.show() inside a print statement.