Experiment - Why distributed training handles large models
Problem:Training a large neural network model on a single GPU causes out-of-memory errors and slow training.
Current Metrics:Training stops early due to CUDA out-of-memory error; no meaningful accuracy achieved.
Issue:The model is too large to fit into the memory of a single GPU, causing training to fail.