0
0
Prompt Engineering / GenAIml~20 mins

Self-hosted LLMs (Llama, Mistral) in Prompt Engineering / GenAI - ML Experiment: Train & Evaluate

Choose your learning style9 modes available
Experiment - Self-hosted LLMs (Llama, Mistral)
Problem:You want to run a large language model (LLM) like Llama or Mistral on your own computer or server to generate text without relying on external APIs.
Current Metrics:Model loads successfully but generates repetitive or low-quality text. Response time is slow. No fine-tuning or prompt optimization applied.
Issue:The model overfits on small prompts and produces repetitive outputs. Performance is slow due to inefficient settings. Quality and speed need improvement.
Your Task
Improve the quality and diversity of generated text while reducing response time when running a self-hosted LLM.
You must keep using the same pre-trained Llama or Mistral model weights.
You cannot use external APIs or cloud services.
You can only change model configuration, prompt design, and inference parameters.
Hint 1
Hint 2
Hint 3
Hint 4
Solution
Prompt Engineering / GenAI
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model_name = 'meta-llama/Llama-2-7b-chat-hf'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map='auto')

# Set generation parameters to reduce repetition and improve quality
input_text = "Explain how photosynthesis works in simple terms."
inputs = tokenizer(input_text, return_tensors='pt').to(model.device)

# Generate with adjusted parameters
outputs = model.generate(
    **inputs,
    max_new_tokens=100,
    do_sample=True,
    temperature=0.8,  # controls randomness
    top_k=50,          # limits to top 50 tokens
    top_p=0.95,        # nucleus sampling
    repetition_penalty=1.2,  # penalize repetition
    num_return_sequences=1
)

result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)
Enabled sampling with temperature=0.8 to add randomness and reduce repetitive text.
Set top_k=50 and top_p=0.95 to focus on more diverse token choices.
Added repetition_penalty=1.2 to discourage repeating the same phrases.
Used torch.float16 and device_map='auto' to speed up inference on GPU.
Kept max_new_tokens reasonable to limit response length.
Results Interpretation

Before: Repetitive, low-quality text with slow response time.
After: More diverse and natural text with faster generation.

Adjusting generation parameters like temperature, top-k, and repetition penalty can reduce overfitting behaviors in self-hosted LLMs and improve output quality and speed without changing the model weights.
Bonus Experiment
Try quantizing the model to 8-bit precision to further speed up inference and reduce memory use.
💡 Hint
Use libraries like bitsandbytes with transformers to load the model in 8-bit mode and compare speed and quality.