0
0
Apache-sparkHow-ToBeginner ยท 3 min read

How to Print Schema in PySpark: Simple Guide

In PySpark, you can print the schema of a DataFrame using the printSchema() method. This method shows the column names, data types, and nullability in a tree format.
๐Ÿ“

Syntax

The syntax to print the schema of a PySpark DataFrame is simple:

  • dataframe.printSchema(): This prints the schema of the DataFrame in a readable tree format.
python
dataframe.printSchema()
๐Ÿ’ป

Example

This example creates a simple DataFrame and prints its schema to show the column names, types, and nullability.

python
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('SchemaExample').getOrCreate()
data = [(1, 'Alice', 29), (2, 'Bob', 31)]
columns = ['id', 'name', 'age']
df = spark.createDataFrame(data, schema=columns)
df.printSchema()
Output
root |-- id: long (nullable = true) |-- name: string (nullable = true) |-- age: long (nullable = true)
โš ๏ธ

Common Pitfalls

Some common mistakes when printing schema in PySpark include:

  • Trying to print schema before creating or loading the DataFrame, which causes errors.
  • Confusing printSchema() with schema property; the latter returns the schema object but does not print it.
  • Not initializing SparkSession before creating DataFrame.
python
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('PitfallExample').getOrCreate()

# Wrong: calling printSchema on None or before DataFrame creation
# df = None
# df.printSchema()  # This will cause an error

# Right way:
data = [(1, 'Alice')]
columns = ['id', 'name']
df = spark.createDataFrame(data, schema=columns)
df.printSchema()
Output
root |-- id: long (nullable = true) |-- name: string (nullable = true)
๐Ÿ“Š

Quick Reference

Remember these tips when printing schema in PySpark:

  • Use printSchema() to display schema in a readable format.
  • Schema shows column names, data types, and if null values are allowed.
  • Always create or load your DataFrame before calling printSchema().
โœ…

Key Takeaways

Use printSchema() method to print DataFrame schema in PySpark.
Schema output shows columns, data types, and nullability in a tree format.
Always create or load your DataFrame before calling printSchema().
Do not confuse printSchema() with the schema property.
Initialize SparkSession before working with DataFrames.