Bird
Raised Fist0
Prompt Engineering / GenAIml~10 mins

Latency optimization in Prompt Engineering / GenAI - Interactive Code Practice

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Practice - 5 Tasks
Answer the questions below
1fill in blank
easy

Complete the code to measure the latency of a function call using time module.

Prompt Engineering / GenAI
import time
start = time.[1]()
result = my_function()
end = time.time()
latency = end - start
print(f"Latency: {latency} seconds")
Drag options to blanks, or click blank then click option'
Asleep
Bperf_counter
Ctime
Dprocess_time
Attempts:
3 left
💡 Hint
Common Mistakes
Using time.sleep() instead of a timer function.
Using time.time() which is less precise.
2fill in blank
medium

Complete the code to batch process inputs to reduce latency in model inference.

Prompt Engineering / GenAI
batch_size = [1]
inputs = get_inputs()
batched_inputs = [inputs[i:i+batch_size] for i in range(0, len(inputs), batch_size)]
Drag options to blanks, or click blank then click option'
A1
B1000
C32
D0
Attempts:
3 left
💡 Hint
Common Mistakes
Using batch size 1 which does not improve latency.
Using batch size 0 which causes errors.
3fill in blank
hard

Fix the error in the code to asynchronously run multiple inference calls to reduce latency.

Prompt Engineering / GenAI
import asyncio

async def infer_async(input):
    return model.predict(input)

async def main():
    tasks = [infer_async(i) for i in inputs]
    results = await asyncio.[1](tasks)
    print(results)

asyncio.run(main())
Drag options to blanks, or click blank then click option'
Agather
Bwait
Crun
Dsleep
Attempts:
3 left
💡 Hint
Common Mistakes
Using asyncio.wait which returns futures, not results directly.
Using asyncio.run inside async function causing errors.
4fill in blank
hard

Fill both blanks to create a dictionary comprehension that filters features with latency less than threshold.

Prompt Engineering / GenAI
latency_dict = {feature: latency for feature, latency in features_latency.items() if latency [1] [2]
Drag options to blanks, or click blank then click option'
A<
B>
C0.05
D0.5
Attempts:
3 left
💡 Hint
Common Mistakes
Using '>' instead of '<' which filters wrong features.
Using too low or too high threshold values.
5fill in blank
hard

Fill all three blanks to create a dictionary comprehension that maps model names to their average latency if latency is above threshold.

Prompt Engineering / GenAI
avg_latency = {model[1]: sum(times)/len(times) for model, times in latency_data.items() if sum(times)/len(times) [2] [3]
Drag options to blanks, or click blank then click option'
A.upper()
B>
C0.1
D.lower()
Attempts:
3 left
💡 Hint
Common Mistakes
Using .upper() instead of .lower() causing inconsistent keys.
Using '<' instead of '>' in the condition.

Practice

(1/5)
1. What is the main goal of latency optimization in AI models?
easy
A. To make AI models respond faster for better user experience
B. To increase the size of the AI model
C. To reduce the accuracy of the AI model
D. To add more layers to the AI model

Solution

  1. Step 1: Understand latency meaning

    Latency means the time it takes for a model to give an answer after input.
  2. Step 2: Connect latency to user experience

    Lower latency means faster responses, which improves how users feel about the AI.
  3. Final Answer:

    To make AI models respond faster for better user experience -> Option A
  4. Quick Check:

    Latency optimization = faster response [OK]
Hint: Latency means speed of response, optimize to make it faster [OK]
Common Mistakes:
  • Confusing latency with model size
  • Thinking latency means accuracy
  • Assuming more layers reduce latency
2. Which of the following is a correct Python syntax to measure latency using time module?
easy
A. start = time.sleep(); model.predict(x); end = time.sleep(); latency = end / start
B. start = time.time(); model.predict(x); end = time.time(); latency = end - start
C. start = time.clock(); model.predict(x); end = time.clock(); latency = start - end
D. start = time.now(); model.predict(x); end = time.now(); latency = end + start

Solution

  1. Step 1: Identify correct time functions

    Python's time.time() returns current time in seconds; subtracting gives elapsed time.
  2. Step 2: Check latency calculation

    Latency = end - start measures duration correctly; other options misuse functions or operations.
  3. Final Answer:

    start = time.time(); model.predict(x); end = time.time(); latency = end - start -> Option B
  4. Quick Check:

    Latency = end - start time [OK]
Hint: Use time.time() and subtract end-start for latency [OK]
Common Mistakes:
  • Using time.now() which does not exist
  • Subtracting start - end instead of end - start
  • Using time.sleep() which pauses code, not measures time
3. Given this code snippet measuring latency, what will be printed?
import time
start = time.time()
for _ in range(1000000):
    pass
end = time.time()
print(round(end - start, 2))
medium
A. A number close to 1.00
B. An error because of wrong syntax
C. A number close to 10.00
D. A number close to 0.00

Solution

  1. Step 1: Understand the loop workload

    The loop runs 1,000,000 times doing nothing (pass), which takes very little time due to Python's loop execution speed.
  2. Step 2: Estimate time taken

    On a normal computer, this empty loop takes around 0.03-0.1 seconds, so round(end - start, 2) prints a number close to 0.00.
  3. Final Answer:

    A number close to 0.00 -> Option D
  4. Quick Check:

    1M empty loops ~0.05s [OK]
Hint: 1 million empty loops take ~0.05 seconds [OK]
Common Mistakes:
  • Overestimating time for empty loop (e.g., thinking 1 second)
  • Thinking it takes 10 seconds
  • Assuming syntax error due to indentation
4. You tried pruning your AI model to reduce latency but latency increased. What is the likely cause?
medium
A. Pruning removed important layers causing slower computation
B. Pruning always increases latency by design
C. Pruning was done incorrectly causing overhead in model execution
D. Latency measurement was done before pruning

Solution

  1. Step 1: Understand pruning effect

    Pruning removes less important parts to speed up model, so latency should decrease if done right.
  2. Step 2: Identify why latency increased

    If latency increased, pruning likely added overhead or was done incorrectly, causing slower execution.
  3. Final Answer:

    Pruning was done incorrectly causing overhead in model execution -> Option C
  4. Quick Check:

    Incorrect pruning = more overhead = higher latency [OK]
Hint: Incorrect pruning adds overhead, increasing latency [OK]
Common Mistakes:
  • Assuming pruning always slows model
  • Ignoring measurement timing
  • Thinking pruning removes important layers by default
5. You want to reduce latency of a large AI model for mobile devices. Which combined approach is best?
hard
A. Use quantization to reduce precision and prune unimportant weights
B. Increase model layers and use caching on server
C. Only use caching without changing model size
D. Train a bigger model with more data

Solution

  1. Step 1: Identify techniques for latency reduction on mobile

    Quantization reduces number size to speed up computation; pruning removes unneeded parts to shrink model.
  2. Step 2: Evaluate options

    Increasing layers or bigger models increase latency; caching helps but alone is not enough for mobile constraints.
  3. Final Answer:

    Use quantization to reduce precision and prune unimportant weights -> Option A
  4. Quick Check:

    Quantization + pruning = best latency reduction [OK]
Hint: Combine quantization and pruning for mobile latency [OK]
Common Mistakes:
  • Thinking bigger models reduce latency
  • Relying only on caching for mobile speed
  • Ignoring model size impact on mobile devices