How to Create DataFrame from List in PySpark Easily
To create a DataFrame from a list in PySpark, use
SparkSession.createDataFrame() by passing your list and optionally a schema. This converts the list into a distributed DataFrame ready for analysis.Syntax
The basic syntax to create a DataFrame from a list in PySpark is:
spark.createDataFrame(data, schema=None)
Where:
datais your list of rows (each row can be a tuple, list, or dict).schemais optional and defines column names and types.
python
df = spark.createDataFrame(data, schema=None)Example
This example shows how to create a PySpark DataFrame from a list of tuples with column names.
python
from pyspark.sql import SparkSession spark = SparkSession.builder.master('local[*]').appName('Example').getOrCreate() # List of tuples data = [(1, 'Alice'), (2, 'Bob'), (3, 'Cathy')] # Define column names columns = ['id', 'name'] # Create DataFrame df = spark.createDataFrame(data, schema=columns) df.show()
Output
+---+-----+
| id| name|
+---+-----+
| 1|Alice|
| 2| Bob|
| 3|Cathy|
+---+-----+
Common Pitfalls
Common mistakes when creating DataFrames from lists include:
- Not providing a schema, which can lead to default column names like
_1,_2. - Passing a list of dictionaries without specifying schema can cause unexpected column order.
- Using inconsistent data types in the list rows causes errors.
Example of a wrong and right way:
python
# Wrong: no schema, list of tuples wrong_df = spark.createDataFrame([(1, 'Alice'), (2, 'Bob')]) wrong_df.show() # Right: specify column names right_df = spark.createDataFrame([(1, 'Alice'), (2, 'Bob')], ['id', 'name']) right_df.show()
Output
+---+-----+
| _1| _2|
+---+-----+
| 1|Alice|
| 2| Bob|
+---+-----+
+---+-----+
| id| name|
+---+-----+
| 1|Alice|
| 2| Bob|
+---+-----+
Quick Reference
| Step | Description | Example |
|---|---|---|
| 1 | Prepare your list of data | [(1, 'Alice'), (2, 'Bob')] |
| 2 | Optionally define column names | ['id', 'name'] |
| 3 | Use spark.createDataFrame() to create DataFrame | spark.createDataFrame(data, schema) |
| 4 | Use df.show() to display the DataFrame | df.show() |
Key Takeaways
Use spark.createDataFrame() to convert a list into a PySpark DataFrame.
Always specify a schema (column names) to avoid default column names.
Ensure data types in the list are consistent for smooth DataFrame creation.
Use df.show() to quickly view the DataFrame content.
Lists of tuples are the most common input format for creating DataFrames.