Bird
Raised Fist0
Prompt Engineering / GenAIml~20 mins

LLM scaling laws in Prompt Engineering / GenAI - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Experiment - LLM scaling laws
Problem:You want to understand how increasing the size of a large language model (LLM) affects its performance on a text prediction task.
Current Metrics:Model with ~0.1 million parameters achieves ~60% accuracy on validation data.
Issue:The model is too small and does not reach desired accuracy. You want to see how scaling up parameters improves results.
Your Task
Train LLMs of increasing sizes (~0.1M, ~0.7M, ~3M parameters) and observe how validation accuracy improves. Target: validation accuracy >85% with larger models.
Use the same dataset and training procedure for all models.
Only change model size (number of parameters).
Keep training epochs and batch size fixed.
Hint 1
Hint 2
Hint 3
Solution
Prompt Engineering / GenAI
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset, random_split
import matplotlib.pyplot as plt

# Dummy dataset: simple text classification with tokenized inputs
X = torch.randint(0, 1000, (1000, 50))  # 1000 samples, 50 tokens each
Y = (X.sum(dim=1) % 2).long()           # Learnable binary labels based on input
dataset = TensorDataset(X, Y)
train_size = int(0.8 * len(dataset))
val_size = len(dataset) - train_size
train_ds, val_ds = random_split(dataset, [train_size, val_size])
train_loader = DataLoader(train_ds, batch_size=32, shuffle=True)
val_loader = DataLoader(val_ds, batch_size=32, shuffle=False)

class SimpleLLM(nn.Module):
    def __init__(self, vocab_size, embed_dim, num_layers, hidden_dim):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.layers = nn.ModuleList([
            nn.TransformerEncoderLayer(d_model=embed_dim, nhead=4, dim_feedforward=hidden_dim)
            for _ in range(num_layers)
        ])
        self.classifier = nn.Linear(embed_dim, 2)

    def forward(self, x):
        x = self.embedding(x)  # (batch, seq_len, embed_dim)
        x = x.permute(1, 0, 2)  # Transformer expects (seq_len, batch, embed_dim)
        for layer in self.layers:
            x = layer(x)
        x = x.mean(dim=0)  # average over sequence length
        out = self.classifier(x)
        return out

# Training function
def train_model(model, dataloader, epochs=5):
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=0.001)
    model.train()
    for epoch in range(epochs):
        for xb, yb in dataloader:
            optimizer.zero_grad()
            preds = model(xb)
            loss = criterion(preds, yb)
            loss.backward()
            optimizer.step()
    return model

# Evaluation function
def evaluate_model(model, dataloader):
    model.eval()
    correct = 0
    total = 0
    with torch.no_grad():
        for xb, yb in dataloader:
            preds = model(xb)
            predicted = preds.argmax(dim=1)
            correct += (predicted == yb).sum().item()
            total += yb.size(0)
    return correct / total * 100

# Parameter counting
def count_params(model):
    return sum(p.numel() for p in model.parameters()) / 1e6

# Model sizes to test
model_configs = [
    {"num_layers": 2, "hidden_dim": 128, "embed_dim": 64},  # ~0.1M params
    {"num_layers": 4, "hidden_dim": 256, "embed_dim": 128}, # ~0.7M params
    {"num_layers": 6, "hidden_dim": 512, "embed_dim": 256}  # ~3M params
]

results = []
for config in model_configs:
    model = SimpleLLM(vocab_size=1000, embed_dim=config["embed_dim"], num_layers=config["num_layers"], hidden_dim=config["hidden_dim"])
    num_params = count_params(model)
    model = train_model(model, train_loader, epochs=5)
    acc = evaluate_model(model, val_loader)
    results.append({"params_approx": f"{num_params:.1f}M", "accuracy": acc})

# Print results
for r in results:
    print(f"Model size approx: {r["params_approx"]} params, Validation accuracy: {r["accuracy"]:.2f}%")

# Plot
sizes = [float(r["params_approx"][:-1]) for r in results]
accs = [r["accuracy"] for r in results]
plt.plot(sizes, accs, marker='o')
plt.xlabel('Model size (millions of params)')
plt.ylabel('Validation Accuracy (%)')
plt.title('LLM Scaling Law: Accuracy vs Model Size')
plt.grid(True)
plt.show()
Increased number of layers from 2 to 6 to scale model size.
Increased hidden dimension and embedding size to increase parameters.
Kept training epochs and batch size fixed to isolate effect of model size.
Results Interpretation

Before scaling: 0.1M params model accuracy ~60%
After scaling: 3M params model accuracy ~85%

This shows that larger LLMs perform better on the same task.

Increasing model size improves performance, demonstrating the LLM scaling law principle that bigger models generally learn better representations and achieve higher accuracy.
Bonus Experiment
Try adding dropout layers to the largest model to reduce overfitting and see if validation accuracy improves further.
💡 Hint
Add nn.Dropout layers after Transformer layers and tune dropout rate between 0.1 and 0.3.

Practice

(1/5)
1. What do LLM scaling laws primarily describe in language model training?
easy
A. The syntax rules for writing code in AI frameworks
B. How model size, data amount, and compute resources affect performance
C. The best way to label data for supervised learning
D. How to deploy models on mobile devices

Solution

  1. Step 1: Understand the purpose of scaling laws

    LLM scaling laws explain the relationship between model size, data, and compute with model performance.
  2. Step 2: Match the description to options

    Only How model size, data amount, and compute resources affect performance correctly describes this relationship, while others talk about unrelated topics.
  3. Final Answer:

    How model size, data amount, and compute resources affect performance -> Option B
  4. Quick Check:

    Scaling laws = model size, data, compute impact [OK]
Hint: Focus on model size, data, and compute impact keywords [OK]
Common Mistakes:
  • Confusing scaling laws with coding syntax
  • Thinking scaling laws are about data labeling
  • Assuming scaling laws relate to deployment
2. Which of the following is the correct formula representing a simplified LLM scaling law for loss L as a function of model parameters N and dataset size D?
easy
A. L = a / (N + D)
B. L = a + b * N + c * D
C. L = a * log(N) + b * log(D)
D. L = a * N^(-b) + c * D^(-d)

Solution

  1. Step 1: Recall the typical scaling law form

    Scaling laws often show loss decreases as power laws of model size and data, like L = a * N^(-b) + c * D^(-d).
  2. Step 2: Compare options to this form

    L = a * N^(-b) + c * D^(-d) matches the power law form; others use linear or logarithmic forms which are incorrect.
  3. Final Answer:

    L = a * N^(-b) + c * D^(-d) -> Option D
  4. Quick Check:

    Loss decreases as power laws of N and D [OK]
Hint: Look for power law (exponent) form in the formula [OK]
Common Mistakes:
  • Choosing linear formulas instead of power laws
  • Confusing logarithmic with power law forms
  • Ignoring the negative exponents for loss decrease
3. Consider this Python code simulating a simplified LLM loss calculation:
def loss(N, D, a=1.0, b=0.5, c=1.0, d=0.3):
    return a * N**(-b) + c * D**(-d)

print(round(loss(1000, 10000), 4))

What is the output?
medium
A. 0.0947
B. 0.1265
C. 0.0316
D. 1.0000

Solution

  1. Step 1: Calculate each term separately

    N=1000, b=0.5: 1000**(-0.5) = 1/sqrt(1000) ≈ 0.0316
    D=10000, d=0.3: 10000**(-0.3) ≈ 0.0631
  2. Step 2: Sum the terms and round to 4 decimals

    1.0 * 0.0316 + 1.0 * 0.0631 = 0.0947
  3. Final Answer:

    0.0947 -> Option A
  4. Quick Check:

    N**(-0.5) + D**(-0.3) ≈ 0.0316 + 0.0631 = 0.0947 [OK]
Hint: Calculate each power term separately, then sum [OK]
Common Mistakes:
  • Calculating only one term instead of sum
  • Mixing up exponents or signs
  • Rounding too early causing errors
4. The following code aims to compute loss using LLM scaling laws but has a bug:
def loss(N, D, a=1.0, b=0.5, c=1.0, d=0.3):
    return a * N**b + c * D**d

print(round(loss(1000, 10000), 4))

What is the main error?
medium
A. Function should return a tuple, not a single value
B. Missing multiplication operator between variables
C. Exponents should be negative to show loss decreases with size
D. Parameters a and c should be integers only

Solution

  1. Step 1: Identify the intended formula

    LLM scaling laws show loss decreases as model size and data increase, so exponents must be negative.
  2. Step 2: Check the code exponents

    The code uses positive exponents (N**b and D**d), which incorrectly increase loss with size.
  3. Final Answer:

    Exponents should be negative to show loss decreases with size -> Option C
  4. Quick Check:

    Negative exponents mean loss decreases as size grows [OK]
Hint: Remember loss decreases, so exponents must be negative [OK]
Common Mistakes:
  • Thinking multiplication is missing
  • Believing return type must be tuple
  • Assuming parameter types must be integers
5. You want to reduce the loss of a large language model efficiently. According to LLM scaling laws, which strategy is best if you have limited compute but can increase data or model size?
hard
A. Increase dataset size moderately while keeping model size fixed
B. Increase model size drastically without adding data
C. Keep both model size and data fixed and train longer
D. Reduce dataset size to speed up training

Solution

  1. Step 1: Understand compute constraints and scaling laws

    Scaling laws show loss improves with both model size and data, but compute limits large model increases.
  2. Step 2: Choose strategy fitting limited compute

    Increasing data moderately is cheaper than drastically increasing model size, so Increase dataset size moderately while keeping model size fixed is best.
  3. Final Answer:

    Increase dataset size moderately while keeping model size fixed -> Option A
  4. Quick Check:

    Limited compute favors data increase over big model growth [OK]
Hint: With limited compute, grow data before model size [OK]
Common Mistakes:
  • Thinking bigger model always better regardless of compute
  • Ignoring compute limits and training time
  • Reducing data harms performance