How to Print Schema in PySpark: Simple Guide
In PySpark, you can print the schema of a DataFrame using the
printSchema() method. This method shows the column names, data types, and nullability in a tree format.Syntax
The syntax to print the schema of a PySpark DataFrame is simple:
dataframe.printSchema(): This prints the schema of the DataFrame in a readable tree format.
python
dataframe.printSchema()
Example
This example creates a simple DataFrame and prints its schema to show the column names, types, and nullability.
python
from pyspark.sql import SparkSession spark = SparkSession.builder.appName('SchemaExample').getOrCreate() data = [(1, 'Alice', 29), (2, 'Bob', 31)] columns = ['id', 'name', 'age'] df = spark.createDataFrame(data, schema=columns) df.printSchema()
Output
root
|-- id: long (nullable = true)
|-- name: string (nullable = true)
|-- age: long (nullable = true)
Common Pitfalls
Some common mistakes when printing schema in PySpark include:
- Trying to print schema before creating or loading the DataFrame, which causes errors.
- Confusing
printSchema()withschemaproperty; the latter returns the schema object but does not print it. - Not initializing SparkSession before creating DataFrame.
python
from pyspark.sql import SparkSession spark = SparkSession.builder.appName('PitfallExample').getOrCreate() # Wrong: calling printSchema on None or before DataFrame creation # df = None # df.printSchema() # This will cause an error # Right way: data = [(1, 'Alice')] columns = ['id', 'name'] df = spark.createDataFrame(data, schema=columns) df.printSchema()
Output
root
|-- id: long (nullable = true)
|-- name: string (nullable = true)
Quick Reference
Remember these tips when printing schema in PySpark:
- Use
printSchema()to display schema in a readable format. - Schema shows column names, data types, and if null values are allowed.
- Always create or load your DataFrame before calling
printSchema().
Key Takeaways
Use
printSchema() method to print DataFrame schema in PySpark.Schema output shows columns, data types, and nullability in a tree format.
Always create or load your DataFrame before calling
printSchema().Do not confuse
printSchema() with the schema property.Initialize SparkSession before working with DataFrames.