Experiment - Bias in generative models

Problem:You have a generative AI model that creates text based on prompts. The model tends to produce biased or stereotypical outputs for certain groups, which is unfair and can cause harm.

Current Metrics:Bias score measured by a fairness metric is 0.35 (on a scale where 0 means no bias and 1 means high bias). The model generates text with biased language in 35% of tested samples.

Issue:The model shows significant bias in generated text, producing unfair stereotypes and unbalanced representations.

Your Task

Reduce the bias score from 0.35 to below 0.15 while maintaining the quality of generated text.

You cannot reduce the size of the training data drastically.

You must keep the model architecture the same.

You can only adjust training methods or add bias mitigation techniques.

Hint 1

Hint 2

Hint 3

Solution

Prompt Engineering / GenAI

import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel, Trainer, TrainingArguments

# Load pretrained model and tokenizer
model_name = 'gpt2'
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)

# Assume we have a balanced fine-tuning dataset 'balanced_dataset' prepared to reduce bias
# This dataset contains prompts and unbiased target texts

# Define a custom loss function to penalize biased outputs (simplified example)
class BiasMitigationTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        outputs = model(**inputs)
        logits = outputs.logits
        labels = inputs['labels']
        loss_fct = torch.nn.CrossEntropyLoss()
        base_loss = loss_fct(logits.view(-1, logits.size(-1)), labels.view(-1))
        # Dummy bias penalty: add small penalty if certain biased tokens appear (example)
        bias_tokens = [tokenizer.encode(word, add_special_tokens=False)[0] for word in ['stereotype1', 'stereotype2']]
        bias_penalty = 0
        for token in bias_tokens:
            bias_penalty += (logits[:, :, token].mean())
        total_loss = base_loss + 0.1 * bias_penalty
        return (total_loss, outputs) if return_outputs else total_loss

training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    evaluation_strategy='epoch',
    save_strategy='epoch',
    logging_dir='./logs',
    logging_steps=10,
    learning_rate=5e-5
)

trainer = BiasMitigationTrainer(
    model=model,
    args=training_args,
    train_dataset=balanced_dataset,
    eval_dataset=balanced_eval_dataset
)

trainer.train()

Fine-tuned the pretrained generative model on a balanced dataset to reduce bias.

Added a custom loss penalty to discourage biased token generation.

Kept the original model architecture unchanged.

Used training arguments with moderate learning rate and batch size for stable fine-tuning.

Results Interpretation

Before: Bias score = 0.35, biased outputs in 35% of samples.

After: Bias score = 0.12, biased outputs in 12% of samples.

Fine-tuning a generative model on balanced data and adding bias penalties during training can effectively reduce bias without changing the model structure.

Bonus Experiment

Try using adversarial training where a discriminator detects bias and the generator learns to avoid it.

💡 Hint

Implement a two-model setup where the discriminator guides the generator to produce unbiased text by providing feedback during training.