How to Use groupBy in PySpark: Syntax and Examples
In PySpark, use
groupBy() on a DataFrame to group rows by one or more columns. After grouping, apply aggregation functions like count(), sum(), or agg() to summarize the grouped data.Syntax
The groupBy() method groups the DataFrame rows by specified columns. You can pass one or multiple column names. After grouping, use aggregation functions to summarize each group.
df.groupBy('column1'): Groups by one column.df.groupBy('col1', 'col2'): Groups by multiple columns.- Use aggregation functions like
count(),sum(),avg(), oragg()after grouping.
python
df.groupBy('column_name').count()Example
This example shows how to group a DataFrame by the 'category' column and count the number of rows in each group.
python
from pyspark.sql import SparkSession spark = SparkSession.builder.appName('GroupByExample').getOrCreate() data = [ ('A', 10), ('B', 20), ('A', 30), ('B', 40), ('C', 50) ] columns = ['category', 'value'] df = spark.createDataFrame(data, columns) # Group by 'category' and count rows result = df.groupBy('category').count() result.show()
Output
+--------+-----+
|category|count|
+--------+-----+
| B| 2|
| C| 1|
| A| 2|
+--------+-----+
Common Pitfalls
Common mistakes when using groupBy() include:
- Forgetting to apply an aggregation function after grouping, which results in a GroupedData object that cannot be displayed directly.
- Passing column names incorrectly, such as using a list instead of separate arguments.
- Using
groupBy()without importing necessary PySpark functions for aggregation.
python
from pyspark.sql import functions as F # Wrong: groupBy without aggregation # grouped = df.groupBy('category') # grouped.show() # This will cause an error # Right: apply aggregation after groupBy result = df.groupBy('category').agg(F.sum('value').alias('total_value')) result.show()
Output
+--------+-----------+
|category|total_value|
+--------+-----------+
| B| 60|
| C| 50|
| A| 40|
+--------+-----------+
Quick Reference
| Usage | Description |
|---|---|
| df.groupBy('col') | Group rows by one column |
| df.groupBy('col1', 'col2') | Group rows by multiple columns |
| grouped.count() | Count rows in each group |
| grouped.sum('col') | Sum values in each group |
| grouped.agg({'col': 'max'}) | Apply custom aggregation |
Key Takeaways
Use groupBy() to group DataFrame rows by one or more columns.
Always apply an aggregation function like count(), sum(), or agg() after groupBy().
Pass column names as separate arguments, not as a list, to groupBy().
Import PySpark functions (e.g., from pyspark.sql import functions as F) for advanced aggregations.
groupBy() returns a GroupedData object that needs aggregation to produce results.