Bird
Raised Fist0
Prompt Engineering / GenAIml~20 mins

Self-hosted LLMs (Llama, Mistral) in Prompt Engineering / GenAI - Practice Problems & Coding Challenges

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Challenge - 5 Problems
🎖️
Self-hosted LLM Mastery
Get all challenges correct to earn this badge!
Test your skills under time pressure!
🧠 Conceptual
intermediate
2:00remaining
Understanding Model Size Impact on Self-hosted LLMs
Which of the following best explains how increasing the number of parameters in a self-hosted LLM like Llama or Mistral affects its performance and resource requirements?
AIncreasing parameters reduces accuracy but speeds up inference and lowers memory use.
BIncreasing parameters decreases both accuracy and resource requirements.
CIncreasing parameters has no effect on accuracy but increases training time only.
DIncreasing parameters improves model accuracy but requires more memory and slower inference.
Attempts:
2 left
💡 Hint
Think about how bigger models usually behave in terms of accuracy and hardware needs.
Predict Output
intermediate
2:00remaining
Output of Token Generation with Temperature in LLM
Given the following pseudocode for generating tokens from a self-hosted LLM with temperature=0.0, what is the expected behavior of the output tokens?
Prompt Engineering / GenAI
tokens = model.generate(input_ids, temperature=0.0, max_length=5)
print(tokens)
AThe model outputs tokens only from a fixed vocabulary subset.
BThe model outputs the most likely tokens deterministically.
CThe model outputs random tokens with equal probability.
DThe model outputs tokens with high randomness and diversity.
Attempts:
2 left
💡 Hint
Temperature controls randomness in token selection.
Model Choice
advanced
2:00remaining
Choosing a Self-hosted LLM for Low-latency Applications
You want to deploy a self-hosted LLM for a chatbot that requires fast responses on limited hardware. Which model choice is best?
AA smaller Mistral 7B model quantized to 4-bit precision.
BA large Llama 70B model with full precision weights.
CA large Mistral 30B model with no quantization.
DA medium Llama 13B model with float32 precision.
Attempts:
2 left
💡 Hint
Consider model size and quantization effects on speed and memory.
Metrics
advanced
2:00remaining
Evaluating Self-hosted LLM Output Quality
Which metric is most appropriate to evaluate the quality of text generated by a self-hosted LLM like Llama or Mistral on a language generation task?
APerplexity measuring how well the model predicts the next token.
BAccuracy measuring exact token matches with ground truth.
CMean Squared Error between predicted and actual token embeddings.
DF1 score measuring classification correctness.
Attempts:
2 left
💡 Hint
Think about metrics used in language modeling tasks.
🔧 Debug
expert
3:00remaining
Debugging Memory Error in Self-hosted LLM Inference
You try to run inference on a Llama 13B model but get a CUDA out-of-memory error. Which action will most likely fix this issue?
AAdd more layers to the model to distribute memory load.
BIncrease learning rate to speed up training and reduce memory.
CReduce batch size or use model quantization to lower memory use.
DDisable GPU and run inference on CPU only.
Attempts:
2 left
💡 Hint
Think about how to reduce GPU memory usage during inference.

Practice

(1/5)
1. What is the main advantage of using self-hosted LLMs like Llama or Mistral?
easy
A. You keep full control and privacy over your data
B. They always run faster than cloud models
C. They require no installation or setup
D. They provide unlimited free internet access

Solution

  1. Step 1: Understand self-hosted LLMs purpose

    Self-hosted LLMs run on your own machines, so your data stays private and under your control.
  2. Step 2: Compare options

    Cloud models may send data externally; self-hosted models avoid this, ensuring privacy.
  3. Final Answer:

    You keep full control and privacy over your data -> Option A
  4. Quick Check:

    Privacy and control = B [OK]
Hint: Self-hosted means data stays with you, so privacy is key [OK]
Common Mistakes:
  • Thinking self-hosted models are always faster
  • Assuming no setup is needed
  • Confusing self-hosted with cloud services
2. Which Python code snippet correctly loads a Llama model using the Hugging Face Transformers library?
easy
A. from transformers import LlamaForCausalLM; model = LlamaForCausalLM.from_pretrained('llama-model')
B. import llama; model = llama.load('llama-model')
C. from transformers import MistralModel; model = MistralModel.load('llama-model')
D. model = load_model('llama-model')

Solution

  1. Step 1: Identify correct library and class

    The Hugging Face Transformers library uses LlamaForCausalLM to load Llama models.
  2. Step 2: Check method to load model

    from_pretrained is the standard method to load pretrained models in Transformers.
  3. Final Answer:

    from transformers import LlamaForCausalLM; model = LlamaForCausalLM.from_pretrained('llama-model') -> Option A
  4. Quick Check:

    Transformers + from_pretrained = C [OK]
Hint: Use Transformers library and from_pretrained to load models [OK]
Common Mistakes:
  • Using wrong import names
  • Calling non-existent load methods
  • Confusing Mistral and Llama classes
3. Given this code snippet using a Mistral model, what will be the output type of output?
from transformers import MistralForCausalLM, MistralTokenizer
model = MistralForCausalLM.from_pretrained('mistral-base')
tokenizer = MistralTokenizer.from_pretrained('mistral-base')
inputs = tokenizer('Hello world', return_tensors='pt')
outputs = model.generate(**inputs)
output = tokenizer.decode(outputs[0])
medium
A. An error because generate is not defined
B. A tensor of token IDs
C. A list of token probabilities
D. A decoded string of generated text

Solution

  1. Step 1: Understand model.generate output

    model.generate returns token IDs as tensors representing generated text tokens.
  2. Step 2: Decode tokens to string

    tokenizer.decode converts token IDs to a readable string.
  3. Final Answer:

    A decoded string of generated text -> Option D
  4. Quick Check:

    generate + decode = string output [OK]
Hint: generate returns tokens; decode converts tokens to string [OK]
Common Mistakes:
  • Thinking output is raw tensor
  • Confusing probabilities with tokens
  • Assuming generate method is missing
4. You try to load a Llama model with this code but get an error:
from transformers import LlamaForCausalLM
model = LlamaForCausalLM.load('llama-model')
What is the likely cause of the error?
medium
A. LlamaForCausalLM cannot be imported from transformers
B. The model name 'llama-model' is invalid
C. The method load() does not exist; should use from_pretrained()
D. You need to install the Mistral library first

Solution

  1. Step 1: Check method names in Transformers

    Transformers models use from_pretrained() to load models, not load().
  2. Step 2: Identify error cause

    Using load() causes AttributeError because it is not defined for LlamaForCausalLM.
  3. Final Answer:

    The method load() does not exist; should use from_pretrained() -> Option C
  4. Quick Check:

    Use from_pretrained, not load [OK]
Hint: Use from_pretrained() to load models, not load() [OK]
Common Mistakes:
  • Assuming load() is valid method
  • Blaming model name without checking method
  • Confusing Llama and Mistral imports
5. You want to run a self-hosted Llama model on your local machine but it has limited RAM. Which approach helps reduce memory usage while keeping reasonable performance?
hard
A. Use a cloud service instead of local hosting
B. Use quantization to reduce model size and load with 8-bit precision
C. Run the model on CPU without any batching
D. Load the full 32-bit model without any optimization

Solution

  1. Step 1: Understand memory constraints

    Limited RAM means loading full 32-bit models is heavy and slow.
  2. Step 2: Apply quantization

    Quantization reduces model size by using lower precision (e.g., 8-bit), saving memory and keeping decent speed.
  3. Step 3: Evaluate other options

    Loading full model wastes memory; CPU without batching is slow; cloud is not self-hosted.
  4. Final Answer:

    Use quantization to reduce model size and load with 8-bit precision -> Option B
  5. Quick Check:

    Quantization saves memory and keeps performance [OK]
Hint: Quantize models to 8-bit for less RAM use [OK]
Common Mistakes:
  • Loading full 32-bit model ignoring RAM limits
  • Running without batching causing slow speed
  • Switching to cloud defeats self-hosting purpose