0
0
Apache Sparkdata~10 mins

Schema validation in Apache Spark - Step-by-Step Execution

Choose your learning style9 modes available
Concept Flow - Schema validation
Define Schema
Load Data
Apply Schema to Data
Validate Data Against Schema
If Valid
Process Data
If Invalid
Raise Error or Handle
End
The flow starts by defining a schema, then loading data, applying the schema, validating the data, and either processing valid data or handling errors for invalid data.
Execution Sample
Apache Spark
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

spark = SparkSession.builder.getOrCreate()

schema = StructType([
    StructField('name', StringType(), True),
    StructField('age', IntegerType(), True)
])

data = [('Alice', 30), ('Bob', 'not_a_number'), ('Charlie', 25)]

# Create DataFrame with schema
try:
    df = spark.createDataFrame(data, schema=schema)
    df.show()
except Exception as e:
    print(f"Schema validation error: {e}")
This code defines a schema for name and age, tries to create a DataFrame with data where one age is invalid, and shows the DataFrame or error.
Execution Table
StepActionData SampleSchema CheckResult
1Define schema with fields 'name' (string) and 'age' (integer)N/AN/ASchema ready
2Load data [('Alice', 30), ('Bob', 'not_a_number'), ('Charlie', 25)]('Alice', 30)N/AData loaded
3Create DataFrame applying schema('Alice', 30)name: string, age: integerRow accepted
4Validate row ('Bob', 'not_a_number')('Bob', 'not_a_number')age should be integerValidation error raised
5Catch exception and print errorN/AN/ASchema validation error printed
6End processN/AN/AProcess stopped due to error
💡 Data contains invalid type for 'age' in second row, causing schema validation error and stopping process.
Variable Tracker
VariableStartAfter Step 2After Step 3After Step 4Final
schemaNoneDefined StructType with name and age fieldsSameSameSame
dataNone[('Alice', 30), ('Bob', 'not_a_number'), ('Charlie', 25)]SameSameSame
dfNoneNoneCreated DataFrame with first row acceptedError on second row, df not fully createdNone due to error
Key Moments - 2 Insights
Why does the DataFrame creation fail when 'age' is 'not_a_number'?
Because the schema expects 'age' to be an integer, but the value 'not_a_number' is a string, causing a type mismatch error as shown in execution_table step 4.
Can Spark partially create a DataFrame if some rows fail schema validation?
No, Spark raises an error and stops creating the DataFrame when it encounters invalid data, as seen in step 5 where the exception is caught.
Visual Quiz - 3 Questions
Test your understanding
Look at the execution table, what happens at step 4 when validating the row ('Bob', 'not_a_number')?
AValidation error is raised due to type mismatch
BRow is accepted without error
CRow is skipped silently
DSchema is changed to accept string
💡 Hint
Refer to execution_table row 4 under 'Result' column
At which step is the schema first defined?
AStep 3
BStep 1
CStep 2
DStep 4
💡 Hint
Check execution_table row 1 under 'Action'
If the invalid age value 'not_a_number' was changed to 40, what would happen at step 4?
ADataFrame creation would fail at step 3
BValidation error still occurs
CRow would be accepted successfully
DSchema would need to be redefined
💡 Hint
Look at variable_tracker for 'df' changes and execution_table step 4 result
Concept Snapshot
Schema validation in Spark:
- Define a StructType schema with field names and types
- Load data as list or RDD
- Create DataFrame applying the schema
- Spark checks each row matches schema types
- If mismatch, Spark raises error and stops
- Valid data proceeds for processing
Full Transcript
Schema validation in Apache Spark involves defining a schema that specifies the expected data types for each column. When loading data, Spark applies this schema to ensure each row matches the expected types. If a row contains a value that does not match the schema, such as a string where an integer is expected, Spark raises a validation error and stops creating the DataFrame. This prevents invalid data from entering the processing pipeline. The example code shows defining a schema with 'name' as string and 'age' as integer, then attempting to create a DataFrame with one invalid age value. The execution trace shows the error occurs at the invalid row, and the process stops with an error message.