Concept Flow - Schema validation

Define Schema

↓

Load Data

↓

Apply Schema to Data

↓

Validate Data Against Schema

↓

If Valid

→Process Data

↓

If Invalid

→Raise Error or Handle

↓

End

The flow starts by defining a schema, then loading data, applying the schema, validating the data, and either processing valid data or handling errors for invalid data.

Execution Sample

Apache Spark

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

spark = SparkSession.builder.getOrCreate()

schema = StructType([
    StructField('name', StringType(), True),
    StructField('age', IntegerType(), True)
])

data = [('Alice', 30), ('Bob', 'not_a_number'), ('Charlie', 25)]

# Create DataFrame with schema
try:
    df = spark.createDataFrame(data, schema=schema)
    df.show()
except Exception as e:
    print(f"Schema validation error: {e}")

This code defines a schema for name and age, tries to create a DataFrame with data where one age is invalid, and shows the DataFrame or error.

Execution Table

Step	Action	Data Sample	Schema Check	Result
1	Define schema with fields 'name' (string) and 'age' (integer)	N/A	N/A	Schema ready
2	Load data [('Alice', 30), ('Bob', 'not_a_number'), ('Charlie', 25)]	('Alice', 30)	N/A	Data loaded
3	Create DataFrame applying schema	('Alice', 30)	name: string, age: integer	Row accepted
4	Validate row ('Bob', 'not_a_number')	('Bob', 'not_a_number')	age should be integer	Validation error raised
5	Catch exception and print error	N/A	N/A	Schema validation error printed
6	End process	N/A	N/A	Process stopped due to error

💡 Data contains invalid type for 'age' in second row, causing schema validation error and stopping process.

Variable Tracker

Variable	Start	After Step 2	After Step 3	After Step 4	Final
schema	None	Defined StructType with name and age fields	Same	Same	Same
data	None	[('Alice', 30), ('Bob', 'not_a_number'), ('Charlie', 25)]	Same	Same	Same
df	None	None	Created DataFrame with first row accepted	Error on second row, df not fully created	None due to error

Key Moments - 2 Insights

Why does the DataFrame creation fail when 'age' is 'not_a_number'?

Can Spark partially create a DataFrame if some rows fail schema validation?

Visual Quiz - 3 Questions

Test your understanding

Look at the execution table, what happens at step 4 when validating the row ('Bob', 'not_a_number')?

AValidation error is raised due to type mismatch

BRow is accepted without error

CRow is skipped silently

DSchema is changed to accept string

Concept Snapshot

Schema validation in Spark:
- Define a StructType schema with field names and types
- Load data as list or RDD
- Create DataFrame applying the schema
- Spark checks each row matches schema types
- If mismatch, Spark raises error and stops
- Valid data proceeds for processing

Full Transcript

Schema validation in Apache Spark involves defining a schema that specifies the expected data types for each column. When loading data, Spark applies this schema to ensure each row matches the expected types. If a row contains a value that does not match the schema, such as a string where an integer is expected, Spark raises a validation error and stops creating the DataFrame. This prevents invalid data from entering the processing pipeline. The example code shows defining a schema with 'name' as string and 'age' as integer, then attempting to create a DataFrame with one invalid age value. The execution trace shows the error occurs at the invalid row, and the process stops with an error message.