0
0
Apache Sparkdata~30 mins

Schema definition and inference in Apache Spark - Mini Project: Build & Apply

Choose your learning style9 modes available
Schema definition and inference
📖 Scenario: You work as a data analyst at a retail company. You receive sales data in CSV format. To analyze it, you need to load it into Spark with the correct schema. Sometimes the schema is given, sometimes you let Spark guess it.
🎯 Goal: Learn how to define a schema manually and how to let Spark infer the schema automatically when loading CSV data.
📋 What You'll Learn
Create a Spark session
Load CSV data with manual schema definition
Load CSV data with schema inference
Display the loaded data
💡 Why This Matters
🌍 Real World
In real data projects, data often comes from CSV files or other sources. Defining the correct schema helps avoid errors and speeds up processing.
💼 Career
Data engineers and data scientists frequently define or infer schemas when loading data into Spark for analysis or machine learning.
Progress0 / 4 steps
1
Create sample sales data as a list of tuples
Create a variable called sales_data as a list of tuples with these exact entries: ("2024-01-01", "Alice", 300), ("2024-01-02", "Bob", 150), ("2024-01-03", "Charlie", 200).
Apache Spark
Need a hint?

Use a list with tuples exactly as shown.

2
Define a manual schema for the sales data
Create a variable called schema using StructType with three fields: date as StringType(), customer as StringType(), and amount as IntegerType(). Import the required types from pyspark.sql.types.
Apache Spark
Need a hint?

Import StructType, StructField, StringType, IntegerType from pyspark.sql.types and define the schema as shown.

3
Create a DataFrame using the manual schema
Create a Spark session called spark. Then create a DataFrame called df_manual from sales_data using spark.createDataFrame() with the schema you defined.
Apache Spark
Need a hint?

Use SparkSession.builder.appName(...).getOrCreate() to create spark. Then create df_manual with createDataFrame and the schema.

4
Load the same data from CSV with schema inference and show both DataFrames
Save sales_data as a CSV file called sales.csv with header. Then read it back into a DataFrame called df_infer using spark.read.csv() with header=True and inferSchema=True. Finally, print the schema and show the first rows of both df_manual and df_infer.
Apache Spark
Need a hint?

Use Python's csv module to write sales_data to sales.csv with header. Then read it with spark.read.csv using header=True and inferSchema=True. Use printSchema() and show() to display results.