Data Analysis Pythondata~10 mins

Aggregation-based features in Data Analysis Python - Step-by-Step Execution

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Concept Flow - Aggregation-based features

Start with raw data

↓

Group data by key(s)

↓

Apply aggregation functions

↓

Create new features from aggregated results

↓

Merge aggregated features back to original data

↓

Use enhanced data for analysis or modeling

We start with raw data, group it by one or more keys, apply aggregation functions like sum or mean, create new features from these results, and merge them back to enrich the original data.

Execution Sample

Data Analysis Python

import pandas as pd

data = pd.DataFrame({
    'user': ['A', 'A', 'B', 'B', 'B'],
    'score': [10, 20, 10, 30, 50]
})

agg = data.groupby('user')['score'].mean().reset_index(name='avg_score')
data = data.merge(agg, on='user')

This code calculates the average score per user and adds it as a new feature to the original data.

Execution Table

Step	Action	Data State	Result
1	Create initial DataFrame	[{'user':'A','score':10}, {'user':'A','score':20}, {'user':'B','score':10}, {'user':'B','score':30}, {'user':'B','score':50}]	DataFrame with 5 rows
2	Group by 'user'	Groups: 'A' -> [10,20], 'B' -> [10,30,50]	Two groups formed
3	Calculate mean score per user	'A': (10+20)/2=15, 'B': (10+30+50)/3=30	Aggregation result: [{'user':'A','avg_score':15}, {'user':'B','avg_score':30}]
4	Merge aggregated avg_score back to original data	Original data + avg_score per user	DataFrame with new column 'avg_score' added
5	Final data	[{'user':'A','score':10,'avg_score':15}, {'user':'A','score':20,'avg_score':15}, {'user':'B','score':10,'avg_score':30}, {'user':'B','score':30,'avg_score':30}, {'user':'B','score':50,'avg_score':30}]	Data enriched with aggregation-based feature

💡 All rows processed and aggregation feature merged successfully

Variable Tracker

Variable	Start	After Step 2	After Step 3	After Step 4	Final
data	Empty	Original DataFrame with 5 rows	Same as step 2	Merged with avg_score column	DataFrame with 'user', 'score', 'avg_score' columns
agg	Not defined	Groups created	Aggregation result with avg_score per user	Same as step 3	Aggregation DataFrame with 'user' and 'avg_score'

Key Moments - 3 Insights

Why do we need to reset_index() after aggregation?

What happens if we merge without specifying 'on' parameter?

Does aggregation change the original data?

Visual Quiz - 3 Questions

Test your understanding

Look at the execution table, what is the average score for user 'B' after step 3?

A20

B40

C30

D50

Concept Snapshot

Aggregation-based features:
- Group data by key(s) using groupby()
- Apply aggregation functions (mean, sum, count, etc.)
- Use reset_index() to convert group keys to columns
- Merge aggregated results back to original data
- Result: new features summarizing grouped info

Full Transcript

Aggregation-based features help us summarize data by groups. We start with raw data, group it by one or more keys, then calculate summary statistics like averages. These summaries become new features that describe each group. We use pandas groupby() to group data, then apply aggregation functions like mean(). After aggregation, we reset the index to keep group keys as columns. Finally, we merge these new features back to the original data so each row has extra information. This process enriches data for better analysis or modeling.