Practice

(1/5)

1. What is the main purpose of distributed training in machine learning?

easy

A. To avoid using GPUs during training

B. To split the training workload across multiple machines or GPUs

C. To increase the learning rate automatically

D. To reduce the size of the training dataset

Solution

Step 1: Understand distributed training goal
Distributed training is designed to share the training task among several machines or GPUs to speed up the process.
Step 2: Analyze options
Only To split the training workload across multiple machines or GPUs correctly describes this purpose. Options A, B, and C do not relate to workload distribution.
Final Answer:
To split the training workload across multiple machines or GPUs -> Option B
Quick Check:
Distributed training = workload split [OK]

Hint: Distributed training means sharing work across machines [OK]

Common Mistakes:

Thinking distributed training reduces dataset size
Confusing distributed training with hyperparameter tuning
Believing distributed training avoids GPU use

2. Which of the following is the correct way to initialize a process group for distributed training in PyTorch?

easy

A. torch.distributed.init_process_group(backend='nccl', rank=0, world_size=1)

B. torch.init_process_group(backend='nccl', rank=0, world_size=1)

C. torch.distributed.start_process_group(backend='nccl', rank=0, world_size=1)

D. torch.distributed.init_group(backend='nccl', rank=0, world_size=1)

Solution

Step 1: Identify correct function name
The correct function to initialize communication is torch.distributed.init_process_group.
Step 2: Check syntax correctness
torch.distributed.init_process_group(backend='nccl', rank=0, world_size=1) uses the correct module and function name with proper parameters. Other options use incorrect function names or modules.
Final Answer:
torch.distributed.init_process_group(backend='nccl', rank=0, world_size=1) -> Option A
Quick Check:
Correct init function = torch.distributed.init_process_group(backend='nccl', rank=0, world_size=1) [OK]

Hint: Use torch.distributed.init_process_group to start communication [OK]

Common Mistakes:

Using wrong function names like start_process_group
Calling init_process_group from wrong module
Misspelling function or module names

3. Given the following code snippet for distributed training setup, what is the output of print(rank, world_size)?

import torch.distributed as dist
rank = 2
world_size = 4
print(rank, world_size)

medium

A. 4 2

B. Error: rank and world_size undefined

C. 0 1

D. 2 4

Solution

Step 1: Analyze variable assignments
Variables rank and world_size are assigned values 2 and 4 respectively before the print statement.
Step 2: Understand print output
Printing rank and world_size will output '2 4' exactly as assigned.
Final Answer:
2 4 -> Option D
Quick Check:
Print rank, world_size = 2 4 [OK]

Hint: Print variables as assigned to see output [OK]

Common Mistakes:

Confusing rank with world_size order
Assuming variables are undefined
Expecting automatic values without assignment

4. You wrote this code to initialize distributed training but get an error:

import torch.distributed as dist
dist.init_process_group(backend='nccl', rank=0)

What is missing that causes the error?

medium

A. The rank parameter should be a string

B. The backend parameter is incorrect

C. The world_size parameter is missing

D. The import statement is wrong

Solution

Step 1: Check init_process_group parameters
The function requires both rank and world_size parameters to know the total number of processes.
Step 2: Identify missing parameter
The code misses world_size, which causes the error.
Final Answer:
The world_size parameter is missing -> Option C
Quick Check:
Missing world_size causes error [OK]

Hint: Always provide world_size with rank in init_process_group [OK]

Common Mistakes:

Omitting world_size parameter
Using wrong backend names
Passing rank as string instead of int

5. In a distributed training setup with 4 GPUs, you want each process to know its rank and the total number of processes. Which code snippet correctly sets this up and prints the rank and world size?

hard

A. import torch.distributed as dist dist.init_process_group(backend='nccl') rank = dist.get_rank() world_size = dist.get_world_size() print(rank, world_size)

B. import torch.distributed as dist world_size = 4 rank = dist.get_rank() dist.init_process_group(backend='nccl', rank=rank, world_size=world_size) print(rank, world_size)

C. import torch.distributed as dist rank = 0 world_size = 4 dist.init_process_group(backend='nccl', rank=rank, world_size=world_size) print(rank, world_size)

D. import torch.distributed as dist rank = dist.get_rank() world_size = dist.get_world_size() dist.init_process_group(backend='nccl', rank=rank, world_size=world_size) print(rank, world_size)

Solution

Step 1: Understand correct initialization order
dist.init_process_group must be called before calling dist.get_rank() or dist.get_world_size() to initialize communication.
Step 2: Analyze each option
import torch.distributed as dist dist.init_process_group(backend='nccl') rank = dist.get_rank() world_size = dist.get_world_size() print(rank, world_size) initializes the process group first, then gets rank and world size, then prints them. Other options either get rank before initialization or pass rank manually without initialization.
Final Answer:
import torch.distributed as dist dist.init_process_group(backend='nccl') rank = dist.get_rank() world_size = dist.get_world_size() print(rank, world_size) -> Option A
Quick Check:
Init first, then get rank/world_size = import torch.distributed as dist dist.init_process_group(backend='nccl') rank = dist.get_rank() world_size = dist.get_world_size() print(rank, world_size) [OK]

Hint: Initialize before getting rank and world size [OK]

Common Mistakes:

Calling get_rank before init_process_group
Passing rank manually without init
Not calling init_process_group at all

Input Size (n = batches)	Approx. Operations
10	10 training steps per epoch
100	100 training steps per epoch
1000	1000 training steps per epoch

Distributed training basics in MLOps - Time & Space Complexity

Start learning this pattern below

Practice

Solution

Step 1: Understand distributed training goal

Step 2: Analyze options

Final Answer:

Quick Check:

Solution

Step 1: Identify correct function name

Step 2: Check syntax correctness

Final Answer:

Quick Check:

Solution

Step 1: Analyze variable assignments

Step 2: Understand print output

Final Answer:

Quick Check:

Solution

Step 1: Check init_process_group parameters

Step 2: Identify missing parameter

Final Answer:

Quick Check:

Solution

Step 1: Understand correct initialization order

Step 2: Analyze each option

Final Answer:

Quick Check: