0
0
Apache-sparkHow-ToBeginner ยท 3 min read

How to Use groupBy in PySpark: Syntax and Examples

In PySpark, use groupBy() on a DataFrame to group rows by one or more columns. After grouping, apply aggregation functions like count(), sum(), or agg() to summarize the grouped data.
๐Ÿ“

Syntax

The groupBy() method groups the DataFrame rows by specified columns. You can pass one or multiple column names. After grouping, use aggregation functions to summarize each group.

  • df.groupBy('column1'): Groups by one column.
  • df.groupBy('col1', 'col2'): Groups by multiple columns.
  • Use aggregation functions like count(), sum(), avg(), or agg() after grouping.
python
df.groupBy('column_name').count()
๐Ÿ’ป

Example

This example shows how to group a DataFrame by the 'category' column and count the number of rows in each group.

python
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('GroupByExample').getOrCreate()

data = [
    ('A', 10),
    ('B', 20),
    ('A', 30),
    ('B', 40),
    ('C', 50)
]

columns = ['category', 'value']

df = spark.createDataFrame(data, columns)

# Group by 'category' and count rows
result = df.groupBy('category').count()

result.show()
Output
+--------+-----+ |category|count| +--------+-----+ | B| 2| | C| 1| | A| 2| +--------+-----+
โš ๏ธ

Common Pitfalls

Common mistakes when using groupBy() include:

  • Forgetting to apply an aggregation function after grouping, which results in a GroupedData object that cannot be displayed directly.
  • Passing column names incorrectly, such as using a list instead of separate arguments.
  • Using groupBy() without importing necessary PySpark functions for aggregation.
python
from pyspark.sql import functions as F

# Wrong: groupBy without aggregation
# grouped = df.groupBy('category')
# grouped.show()  # This will cause an error

# Right: apply aggregation after groupBy
result = df.groupBy('category').agg(F.sum('value').alias('total_value'))
result.show()
Output
+--------+-----------+ |category|total_value| +--------+-----------+ | B| 60| | C| 50| | A| 40| +--------+-----------+
๐Ÿ“Š

Quick Reference

UsageDescription
df.groupBy('col')Group rows by one column
df.groupBy('col1', 'col2')Group rows by multiple columns
grouped.count()Count rows in each group
grouped.sum('col')Sum values in each group
grouped.agg({'col': 'max'})Apply custom aggregation
โœ…

Key Takeaways

Use groupBy() to group DataFrame rows by one or more columns.
Always apply an aggregation function like count(), sum(), or agg() after groupBy().
Pass column names as separate arguments, not as a list, to groupBy().
Import PySpark functions (e.g., from pyspark.sql import functions as F) for advanced aggregations.
groupBy() returns a GroupedData object that needs aggregation to produce results.