Challenge - 5 Problems

🎖️

GroupBy Performance Master

Get all challenges correct to earn this badge!

Test your skills under time pressure!

❓ Predict Output

intermediate

2:00remaining

Output of GroupBy with multiple aggregations

What is the output of this code snippet using pandas GroupBy with multiple aggregations?

Pandas

import pandas as pd

df = pd.DataFrame({
    'Category': ['A', 'A', 'B', 'B', 'C'],
    'Value': [10, 20, 10, 30, 40]
})
result = df.groupby('Category').agg({'Value': ['sum', 'mean']})
print(result)

Category  Value
sum       30
mean      15.0

          Value     
           sum  mean
Category           
A           30  15.0
B           40  20.0
C           40  40.0

Value
sum    100
mean    20.0

Category  Value
A        30
B        40
C        40

Attempts:

2 left

❓ data_output

intermediate

1:30remaining

Number of groups created by GroupBy

Given this DataFrame, how many groups will pandas GroupBy create when grouping by 'Type'?

Pandas

import pandas as pd

df = pd.DataFrame({
    'Type': ['X', 'Y', 'X', 'Z', 'Y', 'X'],
    'Score': [5, 10, 15, 20, 25, 30]
})
groups = df.groupby('Type')
print(len(groups))

Attempts:

2 left

🔧 Debug

advanced

2:30remaining

Identify the cause of slow GroupBy operation

This code runs very slowly on a large DataFrame. What is the main reason for the slow performance?

Pandas

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'Category': np.random.choice(['A', 'B', 'C', 'D'], size=10_000_000),
    'Value': np.random.rand(10_000_000)
})

result = df.groupby('Category').apply(lambda x: x['Value'].sum())

AUsing apply with a lambda function instead of built-in aggregation slows down performance.

BGrouping by a categorical column always causes slowdowns.

CThe DataFrame is too small to benefit from GroupBy optimizations.

DThe random values in 'Value' column cause the slowdown.

Attempts:

2 left

🧠 Conceptual

advanced

1:30remaining

Effect of sorting on GroupBy performance

How does setting the 'sort' parameter to False in pandas GroupBy affect performance and output?

ASetting sort=False removes duplicate groups, improving performance.

BSetting sort=False causes an error because sorting is mandatory.

CSetting sort=False sorts groups alphabetically, slowing down performance.

DSetting sort=False improves performance by skipping sorting but keeps groups in original order.

Attempts:

2 left

🚀 Application

expert

3:00remaining

Optimizing memory usage in GroupBy with large categorical data

You have a DataFrame with 50 million rows and a 'Category' column with 100 unique values. Which approach best optimizes memory and performance for grouping?

AConvert 'Category' to string dtype before grouping.

BKeep 'Category' as object dtype and group directly.

CConvert 'Category' to pandas Categorical dtype before grouping.

DDrop the 'Category' column before grouping to save memory.

Attempts:

2 left