Prompt Engineering / GenAIml~20 mins

Self-hosted LLMs (Llama, Mistral) in Prompt Engineering / GenAI - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Experiment - Self-hosted LLMs (Llama, Mistral)

Problem:You want to run a large language model (LLM) like Llama or Mistral on your own computer or server to generate text without relying on external APIs.

Current Metrics:Model loads successfully but generates repetitive or low-quality text. Response time is slow. No fine-tuning or prompt optimization applied.

Issue:The model overfits on small prompts and produces repetitive outputs. Performance is slow due to inefficient settings. Quality and speed need improvement.

Your Task

Improve the quality and diversity of generated text while reducing response time when running a self-hosted LLM.

You must keep using the same pre-trained Llama or Mistral model weights.

You cannot use external APIs or cloud services.

You can only change model configuration, prompt design, and inference parameters.

Hint 1

Hint 2

Hint 3

Hint 4

Solution

Prompt Engineering / GenAI

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model_name = 'meta-llama/Llama-2-7b-chat-hf'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map='auto')

# Set generation parameters to reduce repetition and improve quality
input_text = "Explain how photosynthesis works in simple terms."
inputs = tokenizer(input_text, return_tensors='pt').to(model.device)

# Generate with adjusted parameters
outputs = model.generate(
    **inputs,
    max_new_tokens=100,
    do_sample=True,
    temperature=0.8,  # controls randomness
    top_k=50,          # limits to top 50 tokens
    top_p=0.95,        # nucleus sampling
    repetition_penalty=1.2,  # penalize repetition
    num_return_sequences=1
)

result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)

Enabled sampling with temperature=0.8 to add randomness and reduce repetitive text.

Set top_k=50 and top_p=0.95 to focus on more diverse token choices.

Added repetition_penalty=1.2 to discourage repeating the same phrases.

Used torch.float16 and device_map='auto' to speed up inference on GPU.

Kept max_new_tokens reasonable to limit response length.

Results Interpretation

Before: Repetitive, low-quality text with slow response time.
After: More diverse and natural text with faster generation.

Adjusting generation parameters like temperature, top-k, and repetition penalty can reduce overfitting behaviors in self-hosted LLMs and improve output quality and speed without changing the model weights.

Bonus Experiment

Try quantizing the model to 8-bit precision to further speed up inference and reduce memory use.

💡 Hint

Use libraries like bitsandbytes with transformers to load the model in 8-bit mode and compare speed and quality.

Practice

(1/5)

1. What is the main advantage of using self-hosted LLMs like Llama or Mistral?

easy

A. You keep full control and privacy over your data

B. They always run faster than cloud models

C. They require no installation or setup

D. They provide unlimited free internet access

Self-hosted LLMs (Llama, Mistral) in Prompt Engineering / GenAI - ML Experiment: Train & Evaluate

Start learning this pattern below

Practice

Solution

Step 1: Understand self-hosted LLMs purpose

Step 2: Compare options

Final Answer:

Quick Check:

Solution

Step 1: Identify correct library and class

Step 2: Check method to load model

Final Answer:

Quick Check:

Solution

Step 1: Understand model.generate output

Step 2: Decode tokens to string

Final Answer:

Quick Check:

Solution

Step 1: Check method names in Transformers

Step 2: Identify error cause

Final Answer:

Quick Check:

Solution

Step 1: Understand memory constraints

Step 2: Apply quantization

Step 3: Evaluate other options

Final Answer:

Quick Check: