How to use groupByKey spark in pyspark

Apache-sparkHow-ToBeginner · 3 min read

How to Use groupByKey in PySpark: Syntax and Example

In PySpark, groupByKey() groups data by keys in an RDD or Pair RDD, returning a new RDD with each key and an iterable of all values for that key. It is used as rdd.groupByKey() where rdd contains key-value pairs.

📐

Syntax

The groupByKey() function is called on an RDD of key-value pairs. It groups all values with the same key into a single sequence.

rdd: An RDD containing tuples of (key, value).
groupByKey(): Groups values by their keys.
Returns a new RDD of (key, iterable of values).

python

grouped_rdd = rdd.groupByKey()

💻

Example

This example shows how to create an RDD of key-value pairs, use groupByKey() to group values by keys, and collect the results.

python

from pyspark.sql import SparkSession

spark = SparkSession.builder.master('local[*]').appName('GroupByKeyExample').getOrCreate()
sc = spark.sparkContext

# Create an RDD of key-value pairs
rdd = sc.parallelize([('a', 1), ('b', 2), ('a', 3), ('b', 4), ('c', 5)])

# Group values by key
grouped_rdd = rdd.groupByKey()

# Collect and print results
result = [(key, list(values)) for key, values in grouped_rdd.collect()]
print(result)

spark.stop()

Output

[('a', [1, 3]), ('b', [2, 4]), ('c', [5])]

⚠️

Common Pitfalls

1. Using groupByKey can cause performance issues: It shuffles all values for each key across the network, which can be slow and memory-heavy.

2. Prefer reduceByKey or aggregateByKey when possible: These combine values locally before shuffling, improving efficiency.

3. groupByKey returns an iterable, not a list: You often need to convert it to a list to use it easily.

python

from pyspark.sql import SparkSession

spark = SparkSession.builder.master('local[*]').appName('CommonPitfalls').getOrCreate()
sc = spark.sparkContext

rdd = sc.parallelize([('a', 1), ('a', 2), ('b', 3)])

# Inefficient way: groupByKey
grouped = rdd.groupByKey()
print([(k, list(v)) for k, v in grouped.collect()])

# Better way: reduceByKey to sum values
reduced = rdd.reduceByKey(lambda x, y: x + y)
print(reduced.collect())

spark.stop()

Output

[('a', [1, 2]), ('b', [3])] [('a', 3), ('b', 3)]

📊

Quick Reference

Function	Description	When to Use
groupByKey()	Groups all values by key into an iterable	When you need all values per key together
reduceByKey(func)	Combines values by key using func before shuffle	When you want to aggregate values efficiently
aggregateByKey(zeroValue, seqFunc, combFunc)	More flexible aggregation by key	For complex aggregations with initial values

✅

Key Takeaways

Use groupByKey() to group values by key in an RDD of key-value pairs.

groupByKey returns an iterable of values per key; convert to list if needed.

Avoid groupByKey for large datasets; prefer reduceByKey or aggregateByKey for better performance.

groupByKey causes a full shuffle of data, which can be slow and memory intensive.

Always test your code on small data before scaling to large datasets.