Challenge - 5 Problems

🎖️

Distributed Training Mastery

Get all challenges correct to earn this badge!

Test your skills under time pressure!

🧠 Conceptual

intermediate

2:00remaining

Why does distributed training help with large models?

Imagine you have a very big neural network that doesn't fit into the memory of a single GPU. Why does using distributed training across multiple GPUs help in this case?

ABecause the model is split across GPUs, so each GPU stores only a part of the model, reducing memory load per GPU.

BBecause distributed training compresses the model to make it smaller automatically.

CBecause distributed training duplicates the entire model on each GPU, increasing memory usage but speeding up training.

DBecause distributed training removes layers from the model to fit it into GPU memory.

Attempts:

2 left

🧠 Conceptual

intermediate

2:00remaining

What is model parallelism in distributed training?

In distributed training, what does model parallelism mean?

ATraining the model on a single device but using multiple CPUs.

BCopying the entire model to each device and training on different data batches.

CSplitting the model layers across multiple devices so each device handles part of the model.

DReducing the model size by pruning neurons before training.

Attempts:

2 left

❓ Predict Output

advanced

2:00remaining

Output of distributed model size check

What will be the output of this PyTorch code snippet that checks model size on each GPU in a distributed setup?

PyTorch

import torch
import torch.nn as nn
import torch.distributed as dist

class BigModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.part1 = nn.Linear(1000, 10000)
        self.part2 = nn.Linear(10000, 1000)

    def forward(self, x):
        x = self.part1(x)
        x = self.part2(x)
        return x

# Assume distributed environment is initialized
rank = dist.get_rank()
model = BigModel()

if rank == 0:
    print(f"GPU {rank} model part1 weight shape: {model.part1.weight.shape}")
else:
    print(f"GPU {rank} model part2 weight shape: {model.part2.weight.shape}")

AGPU 0 model part1 weight shape: torch.Size([10000, 1000])\nGPU 1 model part2 weight shape: torch.Size([10000, 1000])

BGPU 0 model part1 weight shape: torch.Size([10000, 1000])\nGPU 1 model part2 weight shape: torch.Size([1000, 10000])

CGPU 0 model part1 weight shape: torch.Size([1000, 10000])\nGPU 1 model part2 weight shape: torch.Size([1000, 10000])

D)]0001 ,00001[(eziS.hcrot :epahs thgiew 2trap ledom 1 UPGn\)]0001 ,00001[(eziS.hcrot :epahs thgiew 1trap ledom 0 UPG

Attempts:

2 left

❓ Hyperparameter

advanced

2:00remaining

Choosing batch size in distributed training for large models

When training a large model distributed across multiple GPUs, which batch size strategy helps avoid out-of-memory errors?

AUse a very large batch size on a single GPU to speed up training.

BUse the same batch size as single GPU training but duplicate it on all GPUs.

CUse batch size of 1 on all GPUs regardless of model size.

DUse a smaller batch size per GPU to reduce memory usage per device.

Attempts:

2 left

❓ Metrics

expert

2:00remaining

Effect of distributed training on training speed and accuracy

Which statement best describes the typical effect of distributed training on training speed and model accuracy for very large models?

ADistributed training usually speeds up training time without changing final model accuracy if done correctly.

BDistributed training always reduces model accuracy because of data splitting.

CDistributed training slows down training but improves accuracy due to more GPUs.

DDistributed training has no effect on training speed or accuracy.

Attempts:

2 left