For self-hosted large language models like Llama and Mistral, key metrics include perplexity and accuracy on downstream tasks. Perplexity measures how well the model predicts text, showing its understanding of language patterns. Accuracy on tasks like question answering or summarization shows real-world usefulness. These metrics matter because they tell us if the model generates sensible, relevant text and performs well on specific jobs.
Self-hosted LLMs (Llama, Mistral) in Prompt Engineering / GenAI - Model Metrics & Evaluation
Start learning this pattern below
Jump into concepts and practice - no test required
For language models, a confusion matrix is less common. Instead, we use perplexity and task-specific accuracy. For example, on a classification task, a confusion matrix might look like this:
| Predicted Positive | Predicted Negative |
|--------------------|--------------------|
| True Positive (TP) | False Negative (FN) |
| False Positive (FP) | True Negative (TN) |
From this, we calculate precision, recall, and F1 score to understand model errors.
When using self-hosted LLMs for tasks like spam detection or content moderation, precision and recall tradeoffs matter:
- High Precision: The model rarely marks good content as spam. Useful when false alarms are costly.
- High Recall: The model catches most spam, even if some good content is flagged. Important when missing spam is risky.
Choosing which to prioritize depends on the use case. For example, in medical text analysis, high recall is critical to catch all important info.
Good metrics:
- Low perplexity (e.g., below 20) indicating strong language understanding.
- High accuracy (above 85%) on specific tasks like classification or summarization.
- Balanced precision and recall (both above 80%) for classification tasks.
Bad metrics:
- High perplexity (above 50), meaning the model struggles to predict text.
- Low accuracy (below 60%) on tasks, showing poor performance.
- Very low recall (below 50%) causing missed important cases.
- Accuracy paradox: High accuracy can be misleading if data is imbalanced (e.g., many easy examples).
- Data leakage: Using test data during training inflates metrics falsely.
- Overfitting: Model performs well on training but poorly on new data, hiding true performance.
- Ignoring task-specific metrics: Using only perplexity without checking real task results can miss issues.
Your self-hosted LLM has 98% accuracy on a classification task but only 12% recall on the important class. Is it good for production? Why or why not?
Answer: No, it is not good. The low recall means the model misses most important cases, which can be critical depending on the task. High accuracy alone is misleading if the model ignores the key class.
Practice
Solution
Step 1: Understand self-hosted LLMs purpose
Self-hosted LLMs run on your own machines, so your data stays private and under your control.Step 2: Compare options
Cloud models may send data externally; self-hosted models avoid this, ensuring privacy.Final Answer:
You keep full control and privacy over your data -> Option AQuick Check:
Privacy and control = B [OK]
- Thinking self-hosted models are always faster
- Assuming no setup is needed
- Confusing self-hosted with cloud services
Solution
Step 1: Identify correct library and class
The Hugging Face Transformers library uses LlamaForCausalLM to load Llama models.Step 2: Check method to load model
from_pretrained is the standard method to load pretrained models in Transformers.Final Answer:
from transformers import LlamaForCausalLM; model = LlamaForCausalLM.from_pretrained('llama-model') -> Option AQuick Check:
Transformers + from_pretrained = C [OK]
- Using wrong import names
- Calling non-existent load methods
- Confusing Mistral and Llama classes
output?
from transformers import MistralForCausalLM, MistralTokenizer
model = MistralForCausalLM.from_pretrained('mistral-base')
tokenizer = MistralTokenizer.from_pretrained('mistral-base')
inputs = tokenizer('Hello world', return_tensors='pt')
outputs = model.generate(**inputs)
output = tokenizer.decode(outputs[0])Solution
Step 1: Understand model.generate output
model.generate returns token IDs as tensors representing generated text tokens.Step 2: Decode tokens to string
tokenizer.decode converts token IDs to a readable string.Final Answer:
A decoded string of generated text -> Option DQuick Check:
generate + decode = string output [OK]
- Thinking output is raw tensor
- Confusing probabilities with tokens
- Assuming generate method is missing
from transformers import LlamaForCausalLM
model = LlamaForCausalLM.load('llama-model')
What is the likely cause of the error?Solution
Step 1: Check method names in Transformers
Transformers models use from_pretrained() to load models, not load().Step 2: Identify error cause
Using load() causes AttributeError because it is not defined for LlamaForCausalLM.Final Answer:
The method load() does not exist; should use from_pretrained() -> Option CQuick Check:
Use from_pretrained, not load [OK]
- Assuming load() is valid method
- Blaming model name without checking method
- Confusing Llama and Mistral imports
Solution
Step 1: Understand memory constraints
Limited RAM means loading full 32-bit models is heavy and slow.Step 2: Apply quantization
Quantization reduces model size by using lower precision (e.g., 8-bit), saving memory and keeping decent speed.Step 3: Evaluate other options
Loading full model wastes memory; CPU without batching is slow; cloud is not self-hosted.Final Answer:
Use quantization to reduce model size and load with 8-bit precision -> Option BQuick Check:
Quantization saves memory and keeps performance [OK]
- Loading full 32-bit model ignoring RAM limits
- Running without batching causing slow speed
- Switching to cloud defeats self-hosting purpose
