Practice

(1/5)

1. What is the main advantage of using self-hosted LLMs like Llama or Mistral?

easy

A. You keep full control and privacy over your data

B. They always run faster than cloud models

C. They require no installation or setup

D. They provide unlimited free internet access

Solution

Step 1: Understand self-hosted LLMs purpose
Self-hosted LLMs run on your own machines, so your data stays private and under your control.
Step 2: Compare options
Cloud models may send data externally; self-hosted models avoid this, ensuring privacy.
Final Answer:
You keep full control and privacy over your data -> Option A
Quick Check:
Privacy and control = B [OK]

Hint: Self-hosted means data stays with you, so privacy is key [OK]

Common Mistakes:

Thinking self-hosted models are always faster
Assuming no setup is needed
Confusing self-hosted with cloud services

2. Which Python code snippet correctly loads a Llama model using the Hugging Face Transformers library?

easy

A. from transformers import LlamaForCausalLM; model = LlamaForCausalLM.from_pretrained('llama-model')

B. import llama; model = llama.load('llama-model')

C. from transformers import MistralModel; model = MistralModel.load('llama-model')

D. model = load_model('llama-model')

Solution

Step 1: Identify correct library and class
The Hugging Face Transformers library uses LlamaForCausalLM to load Llama models.
Step 2: Check method to load model
from_pretrained is the standard method to load pretrained models in Transformers.
Final Answer:
from transformers import LlamaForCausalLM; model = LlamaForCausalLM.from_pretrained('llama-model') -> Option A
Quick Check:
Transformers + from_pretrained = C [OK]

Hint: Use Transformers library and from_pretrained to load models [OK]

Common Mistakes:

Using wrong import names
Calling non-existent load methods
Confusing Mistral and Llama classes

3. Given this code snippet using a Mistral model, what will be the output type of output?

from transformers import MistralForCausalLM, MistralTokenizer
model = MistralForCausalLM.from_pretrained('mistral-base')
tokenizer = MistralTokenizer.from_pretrained('mistral-base')
inputs = tokenizer('Hello world', return_tensors='pt')
outputs = model.generate(**inputs)
output = tokenizer.decode(outputs[0])

medium

A. An error because generate is not defined

B. A tensor of token IDs

C. A list of token probabilities

D. A decoded string of generated text

Solution

Step 1: Understand model.generate output
model.generate returns token IDs as tensors representing generated text tokens.
Step 2: Decode tokens to string
tokenizer.decode converts token IDs to a readable string.
Final Answer:
A decoded string of generated text -> Option D
Quick Check:
generate + decode = string output [OK]

Hint: generate returns tokens; decode converts tokens to string [OK]

Common Mistakes:

Thinking output is raw tensor
Confusing probabilities with tokens
Assuming generate method is missing

4. You try to load a Llama model with this code but get an error:

from transformers import LlamaForCausalLM
model = LlamaForCausalLM.load('llama-model')

What is the likely cause of the error?

medium

A. LlamaForCausalLM cannot be imported from transformers

B. The model name 'llama-model' is invalid

C. The method load() does not exist; should use from_pretrained()

D. You need to install the Mistral library first

Solution

Step 1: Check method names in Transformers
Transformers models use from_pretrained() to load models, not load().
Step 2: Identify error cause
Using load() causes AttributeError because it is not defined for LlamaForCausalLM.
Final Answer:
The method load() does not exist; should use from_pretrained() -> Option C
Quick Check:
Use from_pretrained, not load [OK]

Hint: Use from_pretrained() to load models, not load() [OK]

Common Mistakes:

Assuming load() is valid method
Blaming model name without checking method
Confusing Llama and Mistral imports

5. You want to run a self-hosted Llama model on your local machine but it has limited RAM. Which approach helps reduce memory usage while keeping reasonable performance?

hard

A. Use a cloud service instead of local hosting

B. Use quantization to reduce model size and load with 8-bit precision

C. Run the model on CPU without any batching

D. Load the full 32-bit model without any optimization

Solution

Step 1: Understand memory constraints
Limited RAM means loading full 32-bit models is heavy and slow.
Step 2: Apply quantization
Quantization reduces model size by using lower precision (e.g., 8-bit), saving memory and keeping decent speed.
Step 3: Evaluate other options
Loading full model wastes memory; CPU without batching is slow; cloud is not self-hosted.
Final Answer:
Use quantization to reduce model size and load with 8-bit precision -> Option B
Quick Check:
Quantization saves memory and keeps performance [OK]

Hint: Quantize models to 8-bit for less RAM use [OK]

Common Mistakes:

Loading full 32-bit model ignoring RAM limits
Running without batching causing slow speed
Switching to cloud defeats self-hosting purpose

Why Self-hosted LLMs (Llama, Mistral) in Prompt Engineering / GenAI? - Purpose & Use Cases

Start learning this pattern below

Practice

Solution

Step 1: Understand self-hosted LLMs purpose

Step 2: Compare options

Final Answer:

Quick Check:

Solution

Step 1: Identify correct library and class

Step 2: Check method to load model

Final Answer:

Quick Check:

Solution

Step 1: Understand model.generate output

Step 2: Decode tokens to string

Final Answer:

Quick Check:

Solution

Step 1: Check method names in Transformers

Step 2: Identify error cause

Final Answer:

Quick Check:

Solution

Step 1: Understand memory constraints

Step 2: Apply quantization

Step 3: Evaluate other options

Final Answer:

Quick Check: