Apache Sparkdata~30 mins

Why data format affects performance in Apache Spark - See It in Action

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Why Data Format Affects Performance in Apache Spark

📖 Scenario: You work as a data analyst at a company that collects sales data daily. The data is saved in different file formats. Your manager wants to understand how the file format affects the speed of reading and processing data in Apache Spark.

🎯 Goal: You will create a Spark DataFrame from a small sales dataset saved in CSV format, then configure a variable to select a file format, read the data in that format, and finally measure and print the time taken to read the data. This will help you see how data format affects performance.

📋 What You'll Learn

Create a Spark session

Create a small sales dataset with exact values

Configure a variable to select the data format

Read the data using the selected format

Measure and print the time taken to read the data

💡 Why This Matters

🌍 Real World

Data engineers and analysts often choose data formats to optimize speed and storage when working with big data tools like Apache Spark.

💼 Career

Understanding how data format affects performance helps in designing efficient data pipelines and improves job performance in data engineering roles.

Progress0 / 4 steps

Create the sales dataset as a list of dictionaries

Create a variable called sales_data as a list of dictionaries with these exact entries: {'date': '2024-01-01', 'product': 'A', 'amount': 100}, {'date': '2024-01-02', 'product': 'B', 'amount': 150}, and {'date': '2024-01-03', 'product': 'C', 'amount': 200}.

Apache Spark

# Your code here

Need a hint?

Use a list with three dictionaries exactly as shown.

Create a variable to select the data format

Create a variable called data_format and set it to the string 'csv'.

Apache Spark

sales_data = [
    {'date': '2024-01-01', 'product': 'A', 'amount': 100},
    {'date': '2024-01-02', 'product': 'B', 'amount': 150},
    {'date': '2024-01-03', 'product': 'C', 'amount': 200}
]
# Your code here

Need a hint?

Set the variable data_format exactly to the string 'csv'.

Read the data using the selected format and measure time

Create a Spark session called spark. Then create a DataFrame called df by reading the sales_data in the format given by data_format. Use spark.createDataFrame(sales_data) to create the DataFrame first, then write it to a temporary file in the selected format. Finally, read the file back using spark.read.format(data_format).load(). Use the time module to measure the time taken to read the data and store it in a variable called read_time.

Apache Spark

sales_data = [
    {'date': '2024-01-01', 'product': 'A', 'amount': 100},
    {'date': '2024-01-02', 'product': 'B', 'amount': 150},
    {'date': '2024-01-03', 'product': 'C', 'amount': 200}
]
data_format = 'csv'
# Your code here

Need a hint?

Use spark.createDataFrame to create the DataFrame, write it to a file in the chosen format, then read it back with spark.read.format(data_format).load(path). Use time.time() to measure reading time.

Print the time taken to read the data

Write a print statement to display the text "Time taken to read data in format: {data_format} is {read_time:.4f} seconds" using an f-string.

Apache Spark

import time
from pyspark.sql import SparkSession

sales_data = [
    {'date': '2024-01-01', 'product': 'A', 'amount': 100},
    {'date': '2024-01-02', 'product': 'B', 'amount': 150},
    {'date': '2024-01-03', 'product': 'C', 'amount': 200}
]
data_format = 'csv'

spark = SparkSession.builder.appName('DataFormatPerformance').getOrCreate()
df = spark.createDataFrame(sales_data)

# Write DataFrame to temporary path in selected format
path = f'/tmp/sales_data.{data_format}'
if data_format == 'csv':
    df.write.mode('overwrite').option('header', 'true').csv(path)
elif data_format == 'parquet':
    df.write.mode('overwrite').parquet(path)
elif data_format == 'json':
    df.write.mode('overwrite').json(path)

start_time = time.time()
df_read = spark.read.format(data_format).option('header', 'true').load(path)
read_time = time.time() - start_time
# Your code here

Need a hint?

Use print(f"Time taken to read data in format: {data_format} is {read_time:.4f} seconds") to show the result.