How to Create DataFrame in PySpark: Syntax and Examples
To create a
DataFrame in PySpark, use the SparkSession.createDataFrame() method with a list of data and a schema or infer schema automatically. You can also create a DataFrame from a list of tuples or a Pandas DataFrame.Syntax
The basic syntax to create a DataFrame in PySpark is:
spark.createDataFrame(data, schema=None)
Where:
sparkis your SparkSession object.datais a list of tuples, list of dictionaries, or a Pandas DataFrame.schemais optional and defines column names and types. If not provided, PySpark tries to infer it.
python
from pyspark.sql import SparkSession spark = SparkSession.builder.appName('example').getOrCreate() data = [(1, 'Alice'), (2, 'Bob')] schema = ['id', 'name'] df = spark.createDataFrame(data, schema)
Example
This example shows how to create a DataFrame from a list of tuples with a schema, then display its content.
python
from pyspark.sql import SparkSession spark = SparkSession.builder.appName('example').getOrCreate() data = [(1, 'Alice'), (2, 'Bob'), (3, 'Charlie')] schema = ['id', 'name'] df = spark.createDataFrame(data, schema) df.show()
Output
+---+-------+
| id| name|
+---+-------+
| 1| Alice|
| 2| Bob|
| 3|Charlie|
+---+-------+
Common Pitfalls
Common mistakes when creating DataFrames in PySpark include:
- Not creating or importing a
SparkSessionbefore creating a DataFrame. - Passing data in an unsupported format (e.g., a plain list without tuples or dicts).
- Not specifying a schema when data types are ambiguous, leading to incorrect type inference.
- Confusing PySpark DataFrame with Pandas DataFrame; they have different methods.
python
from pyspark.sql import SparkSession spark = SparkSession.builder.appName('example').getOrCreate() # Wrong: data as list of lists without schema # data = [[1, 'Alice'], [2, 'Bob']] # df = spark.createDataFrame(data) # This may cause errors or wrong schema # Right: use list of tuples or specify schema data = [(1, 'Alice'), (2, 'Bob')] schema = ['id', 'name'] df = spark.createDataFrame(data, schema) df.show()
Output
+---+-----+
| id| name|
+---+-----+
| 1|Alice|
| 2| Bob|
+---+-----+
Quick Reference
| Method | Description | Example |
|---|---|---|
| spark.createDataFrame(data, schema=None) | Create DataFrame from data with optional schema | spark.createDataFrame([(1, 'A')], ['id', 'name']) |
| df.show() | Display DataFrame content | df.show() |
| spark.read.csv(path) | Create DataFrame by reading CSV file | spark.read.csv('file.csv', header=True) |
| spark.createDataFrame(pandas_df) | Create DataFrame from Pandas DataFrame | spark.createDataFrame(pandas_df) |
Key Takeaways
Always create a SparkSession before creating a DataFrame.
Use spark.createDataFrame() with data and optional schema to create DataFrames.
Provide a schema to avoid incorrect type inference when data types are unclear.
Data must be a list of tuples, dictionaries, or a Pandas DataFrame for spark.createDataFrame().
Use df.show() to quickly view the contents of your DataFrame.