Bird
Raised Fist0
Prompt Engineering / GenAIml~5 mins

Self-hosted LLMs (Llama, Mistral) in Prompt Engineering / GenAI - Cheat Sheet & Quick Revision

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Recall & Review
beginner
What does 'self-hosted LLM' mean?
A self-hosted LLM is a large language model that you run on your own computer or server instead of using a cloud service. It gives you full control over the model and data.
Click to reveal answer
beginner
Name two popular self-hosted LLMs.
Two popular self-hosted LLMs are LLaMA and Mistral. They are open models you can run locally or on your own servers.
Click to reveal answer
intermediate
Why might someone choose to use a self-hosted LLM like Llama or Mistral?
People use self-hosted LLMs to keep data private, avoid cloud costs, customize models, and have faster access without internet delays.
Click to reveal answer
intermediate
What is a key challenge when running self-hosted LLMs?
A key challenge is needing powerful hardware like GPUs and enough memory to run large models efficiently.
Click to reveal answer
advanced
How do Llama and Mistral differ in their design or use?
LLaMA models focus on being efficient and open for research, while Mistral models aim for high performance with fewer parameters, making them faster and lighter.
Click to reveal answer
What is a main advantage of self-hosting an LLM?
AAutomatic cloud updates
BNo need for any hardware
CFull control over data and model
DUnlimited free usage
Which hardware is usually needed to run self-hosted LLMs efficiently?
ASmartphone
BPowerful GPU and enough RAM
CTablet
DBasic laptop CPU
Llama and Mistral are examples of what kind of models?
ASelf-hosted LLMs
BImage recognition models
CCloud-only LLMs
DSpeech-to-text models
Which is NOT a reason to use a self-hosted LLM?
AData privacy
BAvoiding cloud fees
CNo need for internet
DNo hardware requirements
Mistral models are designed to be:
ALightweight and fast
BVery large and slow
COnly for image tasks
DClosed source
Explain what a self-hosted LLM is and why someone might want to use one.
Think about running the model on your own computer instead of online.
You got /2 concepts.
    Compare Llama and Mistral models in terms of their design goals and typical use cases.
    Focus on what makes each model special.
    You got /3 concepts.

      Practice

      (1/5)
      1. What is the main advantage of using self-hosted LLMs like Llama or Mistral?
      easy
      A. You keep full control and privacy over your data
      B. They always run faster than cloud models
      C. They require no installation or setup
      D. They provide unlimited free internet access

      Solution

      1. Step 1: Understand self-hosted LLMs purpose

        Self-hosted LLMs run on your own machines, so your data stays private and under your control.
      2. Step 2: Compare options

        Cloud models may send data externally; self-hosted models avoid this, ensuring privacy.
      3. Final Answer:

        You keep full control and privacy over your data -> Option A
      4. Quick Check:

        Privacy and control = B [OK]
      Hint: Self-hosted means data stays with you, so privacy is key [OK]
      Common Mistakes:
      • Thinking self-hosted models are always faster
      • Assuming no setup is needed
      • Confusing self-hosted with cloud services
      2. Which Python code snippet correctly loads a Llama model using the Hugging Face Transformers library?
      easy
      A. from transformers import LlamaForCausalLM; model = LlamaForCausalLM.from_pretrained('llama-model')
      B. import llama; model = llama.load('llama-model')
      C. from transformers import MistralModel; model = MistralModel.load('llama-model')
      D. model = load_model('llama-model')

      Solution

      1. Step 1: Identify correct library and class

        The Hugging Face Transformers library uses LlamaForCausalLM to load Llama models.
      2. Step 2: Check method to load model

        from_pretrained is the standard method to load pretrained models in Transformers.
      3. Final Answer:

        from transformers import LlamaForCausalLM; model = LlamaForCausalLM.from_pretrained('llama-model') -> Option A
      4. Quick Check:

        Transformers + from_pretrained = C [OK]
      Hint: Use Transformers library and from_pretrained to load models [OK]
      Common Mistakes:
      • Using wrong import names
      • Calling non-existent load methods
      • Confusing Mistral and Llama classes
      3. Given this code snippet using a Mistral model, what will be the output type of output?
      from transformers import MistralForCausalLM, MistralTokenizer
      model = MistralForCausalLM.from_pretrained('mistral-base')
      tokenizer = MistralTokenizer.from_pretrained('mistral-base')
      inputs = tokenizer('Hello world', return_tensors='pt')
      outputs = model.generate(**inputs)
      output = tokenizer.decode(outputs[0])
      medium
      A. An error because generate is not defined
      B. A tensor of token IDs
      C. A list of token probabilities
      D. A decoded string of generated text

      Solution

      1. Step 1: Understand model.generate output

        model.generate returns token IDs as tensors representing generated text tokens.
      2. Step 2: Decode tokens to string

        tokenizer.decode converts token IDs to a readable string.
      3. Final Answer:

        A decoded string of generated text -> Option D
      4. Quick Check:

        generate + decode = string output [OK]
      Hint: generate returns tokens; decode converts tokens to string [OK]
      Common Mistakes:
      • Thinking output is raw tensor
      • Confusing probabilities with tokens
      • Assuming generate method is missing
      4. You try to load a Llama model with this code but get an error:
      from transformers import LlamaForCausalLM
      model = LlamaForCausalLM.load('llama-model')
      What is the likely cause of the error?
      medium
      A. LlamaForCausalLM cannot be imported from transformers
      B. The model name 'llama-model' is invalid
      C. The method load() does not exist; should use from_pretrained()
      D. You need to install the Mistral library first

      Solution

      1. Step 1: Check method names in Transformers

        Transformers models use from_pretrained() to load models, not load().
      2. Step 2: Identify error cause

        Using load() causes AttributeError because it is not defined for LlamaForCausalLM.
      3. Final Answer:

        The method load() does not exist; should use from_pretrained() -> Option C
      4. Quick Check:

        Use from_pretrained, not load [OK]
      Hint: Use from_pretrained() to load models, not load() [OK]
      Common Mistakes:
      • Assuming load() is valid method
      • Blaming model name without checking method
      • Confusing Llama and Mistral imports
      5. You want to run a self-hosted Llama model on your local machine but it has limited RAM. Which approach helps reduce memory usage while keeping reasonable performance?
      hard
      A. Use a cloud service instead of local hosting
      B. Use quantization to reduce model size and load with 8-bit precision
      C. Run the model on CPU without any batching
      D. Load the full 32-bit model without any optimization

      Solution

      1. Step 1: Understand memory constraints

        Limited RAM means loading full 32-bit models is heavy and slow.
      2. Step 2: Apply quantization

        Quantization reduces model size by using lower precision (e.g., 8-bit), saving memory and keeping decent speed.
      3. Step 3: Evaluate other options

        Loading full model wastes memory; CPU without batching is slow; cloud is not self-hosted.
      4. Final Answer:

        Use quantization to reduce model size and load with 8-bit precision -> Option B
      5. Quick Check:

        Quantization saves memory and keeps performance [OK]
      Hint: Quantize models to 8-bit for less RAM use [OK]
      Common Mistakes:
      • Loading full 32-bit model ignoring RAM limits
      • Running without batching causing slow speed
      • Switching to cloud defeats self-hosting purpose