How to Use collect_list in PySpark: Syntax and Examples
In PySpark, use
collect_list to gather all values of a column into a list for each group when using groupBy. It is an aggregate function that returns a list of items without removing duplicates.Syntax
The collect_list function is used as an aggregate function in PySpark. It collects all values from a specified column into a list for each group defined by groupBy.
groupBy(column): Groups the DataFrame by the specified column.agg(collect_list(column)): Aggregates the grouped data by collecting values into a list.
python
from pyspark.sql.functions import collect_list df.groupBy('group_column').agg(collect_list('value_column').alias('collected_values'))
Example
This example shows how to use collect_list to group data by a category and collect all related values into a list.
python
from pyspark.sql import SparkSession from pyspark.sql.functions import collect_list spark = SparkSession.builder.appName('CollectListExample').getOrCreate() # Sample data data = [ ('A', 1), ('A', 2), ('B', 3), ('B', 4), ('B', 3), ('C', 5) ] # Create DataFrame columns = ['category', 'value'] df = spark.createDataFrame(data, columns) # Use collect_list to aggregate values by category result = df.groupBy('category').agg(collect_list('value').alias('values_list')) result.show()
Output
+--------+-----------+
|category|values_list|
+--------+-----------+
| C| [5]|
| B| [3, 4, 3]|
| A| [1, 2]|
+--------+-----------+
Common Pitfalls
One common mistake is expecting collect_list to remove duplicates; it does not. Use collect_set if you want unique values.
Another pitfall is using collect_list without groupBy, which will collect all values from the entire DataFrame into one list.
python
from pyspark.sql.functions import collect_set, collect_list # Wrong: collect_list without groupBy collects all values into one list all_values = df.agg(collect_list('value').alias('all_values')).show() # Right: groupBy before collect_list grouped_values = df.groupBy('category').agg(collect_list('value').alias('values_list')).show() # To remove duplicates, use collect_set unique_values = df.groupBy('category').agg(collect_set('value').alias('unique_values')).show()
Output
+----------+
|all_values|
+----------+
|[1, 2, 3, 4, 3, 5]|
+----------+
+--------+-----------+
|category|values_list|
+--------+-----------+
| C| [5]|
| B| [3, 4, 3]|
| A| [1, 2]|
+--------+-----------+
+--------+-------------+
|category|unique_values|
+--------+-------------+
| C| [5]|
| B| [3, 4]|
| A| [1, 2]|
+--------+-------------+
Quick Reference
- collect_list(column): Collects all values into a list (duplicates included).
- collect_set(column): Collects unique values into a list (no duplicates).
- Always use
groupBybeforecollect_listto aggregate by groups. - Use
.alias()to name the aggregated column.
Key Takeaways
Use collect_list with groupBy to gather column values into lists per group.
collect_list keeps duplicates; use collect_set to remove duplicates.
Always alias the aggregated column for clear output.
Without groupBy, collect_list aggregates all rows into one list.
collect_list is useful for creating list-type summaries in grouped data.