0
0
Apache-sparkHow-ToBeginner ยท 4 min read

How to Create DataFrame in PySpark: Syntax and Examples

To create a DataFrame in PySpark, use the SparkSession.createDataFrame() method with a list of data and a schema or infer schema automatically. You can also create a DataFrame from a list of tuples or a Pandas DataFrame.
๐Ÿ“

Syntax

The basic syntax to create a DataFrame in PySpark is:

  • spark.createDataFrame(data, schema=None)

Where:

  • spark is your SparkSession object.
  • data is a list of tuples, list of dictionaries, or a Pandas DataFrame.
  • schema is optional and defines column names and types. If not provided, PySpark tries to infer it.
python
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('example').getOrCreate()

data = [(1, 'Alice'), (2, 'Bob')]
schema = ['id', 'name']

df = spark.createDataFrame(data, schema)
๐Ÿ’ป

Example

This example shows how to create a DataFrame from a list of tuples with a schema, then display its content.

python
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('example').getOrCreate()

data = [(1, 'Alice'), (2, 'Bob'), (3, 'Charlie')]
schema = ['id', 'name']

df = spark.createDataFrame(data, schema)
df.show()
Output
+---+-------+ | id| name| +---+-------+ | 1| Alice| | 2| Bob| | 3|Charlie| +---+-------+
โš ๏ธ

Common Pitfalls

Common mistakes when creating DataFrames in PySpark include:

  • Not creating or importing a SparkSession before creating a DataFrame.
  • Passing data in an unsupported format (e.g., a plain list without tuples or dicts).
  • Not specifying a schema when data types are ambiguous, leading to incorrect type inference.
  • Confusing PySpark DataFrame with Pandas DataFrame; they have different methods.
python
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('example').getOrCreate()

# Wrong: data as list of lists without schema
# data = [[1, 'Alice'], [2, 'Bob']]
# df = spark.createDataFrame(data)  # This may cause errors or wrong schema

# Right: use list of tuples or specify schema

data = [(1, 'Alice'), (2, 'Bob')]
schema = ['id', 'name']
df = spark.createDataFrame(data, schema)
df.show()
Output
+---+-----+ | id| name| +---+-----+ | 1|Alice| | 2| Bob| +---+-----+
๐Ÿ“Š

Quick Reference

MethodDescriptionExample
spark.createDataFrame(data, schema=None)Create DataFrame from data with optional schemaspark.createDataFrame([(1, 'A')], ['id', 'name'])
df.show()Display DataFrame contentdf.show()
spark.read.csv(path)Create DataFrame by reading CSV filespark.read.csv('file.csv', header=True)
spark.createDataFrame(pandas_df)Create DataFrame from Pandas DataFramespark.createDataFrame(pandas_df)
โœ…

Key Takeaways

Always create a SparkSession before creating a DataFrame.
Use spark.createDataFrame() with data and optional schema to create DataFrames.
Provide a schema to avoid incorrect type inference when data types are unclear.
Data must be a list of tuples, dictionaries, or a Pandas DataFrame for spark.createDataFrame().
Use df.show() to quickly view the contents of your DataFrame.