0
0
Apache-sparkHow-ToBeginner ยท 3 min read

How to Create DataFrame from List in PySpark Easily

To create a DataFrame from a list in PySpark, use SparkSession.createDataFrame() by passing your list and optionally a schema. This converts the list into a distributed DataFrame ready for analysis.
๐Ÿ“

Syntax

The basic syntax to create a DataFrame from a list in PySpark is:

  • spark.createDataFrame(data, schema=None)

Where:

  • data is your list of rows (each row can be a tuple, list, or dict).
  • schema is optional and defines column names and types.
python
df = spark.createDataFrame(data, schema=None)
๐Ÿ’ป

Example

This example shows how to create a PySpark DataFrame from a list of tuples with column names.

python
from pyspark.sql import SparkSession

spark = SparkSession.builder.master('local[*]').appName('Example').getOrCreate()

# List of tuples
data = [(1, 'Alice'), (2, 'Bob'), (3, 'Cathy')]

# Define column names
columns = ['id', 'name']

# Create DataFrame
df = spark.createDataFrame(data, schema=columns)

df.show()
Output
+---+-----+ | id| name| +---+-----+ | 1|Alice| | 2| Bob| | 3|Cathy| +---+-----+
โš ๏ธ

Common Pitfalls

Common mistakes when creating DataFrames from lists include:

  • Not providing a schema, which can lead to default column names like _1, _2.
  • Passing a list of dictionaries without specifying schema can cause unexpected column order.
  • Using inconsistent data types in the list rows causes errors.

Example of a wrong and right way:

python
# Wrong: no schema, list of tuples
wrong_df = spark.createDataFrame([(1, 'Alice'), (2, 'Bob')])
wrong_df.show()

# Right: specify column names
right_df = spark.createDataFrame([(1, 'Alice'), (2, 'Bob')], ['id', 'name'])
right_df.show()
Output
+---+-----+ | _1| _2| +---+-----+ | 1|Alice| | 2| Bob| +---+-----+ +---+-----+ | id| name| +---+-----+ | 1|Alice| | 2| Bob| +---+-----+
๐Ÿ“Š

Quick Reference

StepDescriptionExample
1Prepare your list of data[(1, 'Alice'), (2, 'Bob')]
2Optionally define column names['id', 'name']
3Use spark.createDataFrame() to create DataFramespark.createDataFrame(data, schema)
4Use df.show() to display the DataFramedf.show()
โœ…

Key Takeaways

Use spark.createDataFrame() to convert a list into a PySpark DataFrame.
Always specify a schema (column names) to avoid default column names.
Ensure data types in the list are consistent for smooth DataFrame creation.
Use df.show() to quickly view the DataFrame content.
Lists of tuples are the most common input format for creating DataFrames.