Apache Sparkdata~30 mins

Schema validation in Apache Spark - Mini Project: Build & Apply

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Schema validation

📖 Scenario: You work in a company that collects customer data. You want to make sure the data is clean and follows the correct format before analysis.

🎯 Goal: You will create a Spark DataFrame with customer data, define a schema to validate the data types, and check if the data matches the schema.

📋 What You'll Learn

Create a Spark DataFrame with given customer data

Define a schema with correct data types for each column

Apply the schema to the DataFrame to validate data

Show the validated DataFrame

💡 Why This Matters

🌍 Real World

Data scientists often receive raw data that may have inconsistent types. Schema validation helps ensure data quality before analysis.

💼 Career

Knowing how to define and apply schemas in Spark is essential for data engineers and data scientists working with big data pipelines.

Progress0 / 4 steps

Create customer data as a list of tuples

Create a variable called customer_data as a list of tuples with these exact entries: (1, 'Alice', 29), (2, 'Bob', 35), (3, 'Charlie', 40).

Apache Spark

# Create a variable called customer_data with the given tuples
# Your code here

Need a hint?

Use square brackets to create a list and parentheses for each tuple.

Define the schema for the customer data

Create a variable called schema using StructType with three fields: id as IntegerType(), name as StringType(), and age as IntegerType(). Import the necessary types from pyspark.sql.types.

Apache Spark

customer_data = [(1, 'Alice', 29), (2, 'Bob', 35), (3, 'Charlie', 40)]
# Import StructType, StructField, IntegerType, StringType
# Define schema with fields id, name, age
# Your code here

Need a hint?

Use StructType and StructField to define the schema fields.

Create a Spark DataFrame with the schema

Create a Spark DataFrame called df from customer_data using spark.createDataFrame() and apply the schema to it.

Apache Spark

from pyspark.sql.types import StructType, StructField, IntegerType, StringType

customer_data = [(1, 'Alice', 29), (2, 'Bob', 35), (3, 'Charlie', 40)]

schema = StructType([
    StructField('id', IntegerType(), True),
    StructField('name', StringType(), True),
    StructField('age', IntegerType(), True)
])

# Create a DataFrame called df using spark.createDataFrame with customer_data and schema
# Your code here

Need a hint?

Use spark.createDataFrame() with the schema argument.

Show the validated DataFrame

Use printSchema() on df to display the schema, then use df.show() to display the data.

Apache Spark

from pyspark.sql.types import StructType, StructField, IntegerType, StringType

customer_data = [(1, 'Alice', 29), (2, 'Bob', 35), (3, 'Charlie', 40)]

schema = StructType([
    StructField('id', IntegerType(), True),
    StructField('name', StringType(), True),
    StructField('age', IntegerType(), True)
])

df = spark.createDataFrame(customer_data, schema=schema)

# Print the schema of df
# Show the data in df
# Your code here

Need a hint?

Use df.printSchema() to see the schema and df.show() to see the data.