Apache-sparkHow-ToBeginner · 4 min read

How to Create DataFrame in PySpark: Syntax and Examples

To create a DataFrame in PySpark, use the SparkSession.createDataFrame() method with a list of data and a schema or infer schema automatically. You can also create a DataFrame from a list of tuples or a Pandas DataFrame.

📐

Syntax

The basic syntax to create a DataFrame in PySpark is:

spark.createDataFrame(data, schema=None)

Where:

spark is your SparkSession object.
data is a list of tuples, list of dictionaries, or a Pandas DataFrame.
schema is optional and defines column names and types. If not provided, PySpark tries to infer it.

python

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('example').getOrCreate()

data = [(1, 'Alice'), (2, 'Bob')]
schema = ['id', 'name']

df = spark.createDataFrame(data, schema)

💻

Example

This example shows how to create a DataFrame from a list of tuples with a schema, then display its content.

python

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('example').getOrCreate()

data = [(1, 'Alice'), (2, 'Bob'), (3, 'Charlie')]
schema = ['id', 'name']

df = spark.createDataFrame(data, schema)
df.show()

Output

+---+-------+ | id| name| +---+-------+ | 1| Alice| | 2| Bob| | 3|Charlie| +---+-------+

⚠️

Common Pitfalls

Common mistakes when creating DataFrames in PySpark include:

Not creating or importing a SparkSession before creating a DataFrame.
Passing data in an unsupported format (e.g., a plain list without tuples or dicts).
Not specifying a schema when data types are ambiguous, leading to incorrect type inference.
Confusing PySpark DataFrame with Pandas DataFrame; they have different methods.

python

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('example').getOrCreate()

# Wrong: data as list of lists without schema
# data = [[1, 'Alice'], [2, 'Bob']]
# df = spark.createDataFrame(data)  # This may cause errors or wrong schema

# Right: use list of tuples or specify schema

data = [(1, 'Alice'), (2, 'Bob')]
schema = ['id', 'name']
df = spark.createDataFrame(data, schema)
df.show()

Output

+---+-----+ | id| name| +---+-----+ | 1|Alice| | 2| Bob| +---+-----+

📊

Quick Reference

Method	Description	Example
spark.createDataFrame(data, schema=None)	Create DataFrame from data with optional schema	spark.createDataFrame([(1, 'A')], ['id', 'name'])
df.show()	Display DataFrame content	df.show()
spark.read.csv(path)	Create DataFrame by reading CSV file	spark.read.csv('file.csv', header=True)
spark.createDataFrame(pandas_df)	Create DataFrame from Pandas DataFrame	spark.createDataFrame(pandas_df)

✅

Key Takeaways

Always create a SparkSession before creating a DataFrame.

Use spark.createDataFrame() with data and optional schema to create DataFrames.

Provide a schema to avoid incorrect type inference when data types are unclear.

Data must be a list of tuples, dictionaries, or a Pandas DataFrame for spark.createDataFrame().

Use df.show() to quickly view the contents of your DataFrame.