Apache Sparkdata~30 mins

Schema definition and inference in Apache Spark - Mini Project: Build & Apply

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Schema definition and inference

📖 Scenario: You work as a data analyst at a retail company. You receive sales data in CSV format. To analyze it, you need to load it into Spark with the correct schema. Sometimes the schema is given, sometimes you let Spark guess it.

🎯 Goal: Learn how to define a schema manually and how to let Spark infer the schema automatically when loading CSV data.

📋 What You'll Learn

Create a Spark session

Load CSV data with manual schema definition

Load CSV data with schema inference

Display the loaded data

💡 Why This Matters

🌍 Real World

In real data projects, data often comes from CSV files or other sources. Defining the correct schema helps avoid errors and speeds up processing.

💼 Career

Data engineers and data scientists frequently define or infer schemas when loading data into Spark for analysis or machine learning.

Progress0 / 4 steps

Create sample sales data as a list of tuples

Create a variable called sales_data as a list of tuples with these exact entries: ("2024-01-01", "Alice", 300), ("2024-01-02", "Bob", 150), ("2024-01-03", "Charlie", 200).

Apache Spark

# Create a variable called sales_data with the given tuples
# Your code here

Need a hint?

Use a list with tuples exactly as shown.

Define a manual schema for the sales data

Create a variable called schema using StructType with three fields: date as StringType(), customer as StringType(), and amount as IntegerType(). Import the required types from pyspark.sql.types.

Apache Spark

sales_data = [("2024-01-01", "Alice", 300), ("2024-01-02", "Bob", 150), ("2024-01-03", "Charlie", 200)]
# Define a variable called schema with StructType and the specified fields
# Your code here

Need a hint?

Import StructType, StructField, StringType, IntegerType from pyspark.sql.types and define the schema as shown.

Create a DataFrame using the manual schema

Create a Spark session called spark. Then create a DataFrame called df_manual from sales_data using spark.createDataFrame() with the schema you defined.

Apache Spark

from pyspark.sql.types import StructType, StructField, StringType, IntegerType

sales_data = [("2024-01-01", "Alice", 300), ("2024-01-02", "Bob", 150), ("2024-01-03", "Charlie", 200)]

schema = StructType([
    StructField("date", StringType(), True),
    StructField("customer", StringType(), True),
    StructField("amount", IntegerType(), True)
])

# Create a Spark session called spark
# Create a DataFrame called df_manual from sales_data using the schema
# Your code here

Need a hint?

Use SparkSession.builder.appName(...).getOrCreate() to create spark. Then create df_manual with createDataFrame and the schema.

Load the same data from CSV with schema inference and show both DataFrames

Save sales_data as a CSV file called sales.csv with header. Then read it back into a DataFrame called df_infer using spark.read.csv() with header=True and inferSchema=True. Finally, print the schema and show the first rows of both df_manual and df_infer.

Apache Spark

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

sales_data = [("2024-01-01", "Alice", 300), ("2024-01-02", "Bob", 150), ("2024-01-03", "Charlie", 200)]

schema = StructType([
    StructField("date", StringType(), True),
    StructField("customer", StringType(), True),
    StructField("amount", IntegerType(), True)
])

spark = SparkSession.builder.appName("SchemaExample").getOrCreate()

df_manual = spark.createDataFrame(sales_data, schema=schema)

# Save sales_data as sales.csv with header
# Read sales.csv into df_infer with header=True and inferSchema=True
# Print schema and show first rows of df_manual and df_infer
# Your code here

Need a hint?

Use Python's csv module to write sales_data to sales.csv with header. Then read it with spark.read.csv using header=True and inferSchema=True. Use printSchema() and show() to display results.