How to use collect_list pyspark

Apache-sparkHow-ToBeginner · 3 min read

How to Use collect_list in PySpark: Syntax and Examples

In PySpark, use collect_list to gather all values of a column into a list for each group when using groupBy. It is an aggregate function that returns a list of items without removing duplicates.

📐

Syntax

The collect_list function is used as an aggregate function in PySpark. It collects all values from a specified column into a list for each group defined by groupBy.

groupBy(column): Groups the DataFrame by the specified column.
agg(collect_list(column)): Aggregates the grouped data by collecting values into a list.

python

from pyspark.sql.functions import collect_list

df.groupBy('group_column').agg(collect_list('value_column').alias('collected_values'))

💻

Example

This example shows how to use collect_list to group data by a category and collect all related values into a list.

python

from pyspark.sql import SparkSession
from pyspark.sql.functions import collect_list

spark = SparkSession.builder.appName('CollectListExample').getOrCreate()

# Sample data
data = [
    ('A', 1),
    ('A', 2),
    ('B', 3),
    ('B', 4),
    ('B', 3),
    ('C', 5)
]

# Create DataFrame
columns = ['category', 'value']
df = spark.createDataFrame(data, columns)

# Use collect_list to aggregate values by category
result = df.groupBy('category').agg(collect_list('value').alias('values_list'))

result.show()

Output

+--------+-----------+ |category|values_list| +--------+-----------+ | C| [5]| | B| [3, 4, 3]| | A| [1, 2]| +--------+-----------+

⚠️

Common Pitfalls

One common mistake is expecting collect_list to remove duplicates; it does not. Use collect_set if you want unique values.

Another pitfall is using collect_list without groupBy, which will collect all values from the entire DataFrame into one list.

python

from pyspark.sql.functions import collect_set, collect_list

# Wrong: collect_list without groupBy collects all values into one list
all_values = df.agg(collect_list('value').alias('all_values')).show()

# Right: groupBy before collect_list
grouped_values = df.groupBy('category').agg(collect_list('value').alias('values_list')).show()

# To remove duplicates, use collect_set
unique_values = df.groupBy('category').agg(collect_set('value').alias('unique_values')).show()

Output

+----------+ |all_values| +----------+ |[1, 2, 3, 4, 3, 5]| +----------+ +--------+-----------+ |category|values_list| +--------+-----------+ | C| [5]| | B| [3, 4, 3]| | A| [1, 2]| +--------+-----------+ +--------+-------------+ |category|unique_values| +--------+-------------+ | C| [5]| | B| [3, 4]| | A| [1, 2]| +--------+-------------+

📊

Quick Reference

collect_list(column): Collects all values into a list (duplicates included).
collect_set(column): Collects unique values into a list (no duplicates).
Always use groupBy before collect_list to aggregate by groups.
Use .alias() to name the aggregated column.

✅

Key Takeaways

Use collect_list with groupBy to gather column values into lists per group.

collect_list keeps duplicates; use collect_set to remove duplicates.

Always alias the aggregated column for clear output.

Without groupBy, collect_list aggregates all rows into one list.

collect_list is useful for creating list-type summaries in grouped data.