GroupBy helps you organize data into groups based on one or more columns. Aggregations let you calculate summary values like sums or averages for each group.
GroupBy and aggregations in Apache Spark
df.groupBy("column_name").agg(functions)Use groupBy to split data into groups by column(s).
Use agg to apply aggregation functions like sum, avg, count.
df.groupBy("category").sum("sales")
df.groupBy("city").agg({'temperature': 'avg'})
df.groupBy("country").count()This program creates a Spark DataFrame with sales data by category. It then groups the data by category and calculates the sum, average, and maximum sales for each group. Finally, it prints the results.
from pyspark.sql import SparkSession from pyspark.sql.functions import sum, avg, max spark = SparkSession.builder.appName("GroupByExample").getOrCreate() # Sample data data = [ ("Electronics", 100), ("Electronics", 150), ("Clothing", 200), ("Clothing", 50), ("Groceries", 300) ] # Create DataFrame columns = ["category", "sales"] df = spark.createDataFrame(data, columns) # Group by category and sum sales sum_sales = df.groupBy("category").sum("sales") # Group by category and average sales avg_sales = df.groupBy("category").agg(avg("sales").alias("avg_sales")) # Group by category and max sales max_sales = df.groupBy("category").agg(max("sales").alias("max_sales")) # Show results print("Sum of sales by category:") sum_sales.show() print("Average sales by category:") avg_sales.show() print("Max sales by category:") max_sales.show() spark.stop()
You can group by multiple columns by passing a list to groupBy, like groupBy(['col1', 'col2']).
Aggregation functions like sum, avg, max, min, and count are available in pyspark.sql.functions.
Always give meaningful names to aggregated columns using alias for clearer output.
GroupBy splits data into groups based on column values.
Aggregations calculate summary statistics for each group.
Use Spark functions like sum, avg, max inside agg for flexible summaries.