import time def simulate_latency(batch_size): base_time = 0.01 # seconds per sample total_time = base_time * batch_size time.sleep(total_time) return total_time latencies = {b: simulate_latency(b) for b in [1, 5, 10]} print(latencies)

def predict(model, data): results = [] for sample in data: processed = preprocess(sample) output = model(processed) results.append(output) return results # preprocess is slow due to heavy image resizing # model is optimized and fast

Practice

(1/5)

1. What is the main goal of latency optimization in AI models?

easy

A. To make AI models respond faster for better user experience

B. To increase the size of the AI model

C. To reduce the accuracy of the AI model

D. To add more layers to the AI model

Solution

Step 1: Understand latency meaning
Latency means the time it takes for a model to give an answer after input.
Step 2: Connect latency to user experience
Lower latency means faster responses, which improves how users feel about the AI.
Final Answer:
To make AI models respond faster for better user experience -> Option A
Quick Check:
Latency optimization = faster response [OK]

Hint: Latency means speed of response, optimize to make it faster [OK]

Common Mistakes:

Confusing latency with model size
Thinking latency means accuracy
Assuming more layers reduce latency

2. Which of the following is a correct Python syntax to measure latency using time module?

easy

A. start = time.sleep(); model.predict(x); end = time.sleep(); latency = end / start

B. start = time.time(); model.predict(x); end = time.time(); latency = end - start

C. start = time.clock(); model.predict(x); end = time.clock(); latency = start - end

D. start = time.now(); model.predict(x); end = time.now(); latency = end + start

Solution

Step 1: Identify correct time functions
Python's time.time() returns current time in seconds; subtracting gives elapsed time.
Step 2: Check latency calculation
Latency = end - start measures duration correctly; other options misuse functions or operations.
Final Answer:
start = time.time(); model.predict(x); end = time.time(); latency = end - start -> Option B
Quick Check:
Latency = end - start time [OK]

Hint: Use time.time() and subtract end-start for latency [OK]

Common Mistakes:

Using time.now() which does not exist
Subtracting start - end instead of end - start
Using time.sleep() which pauses code, not measures time

3. Given this code snippet measuring latency, what will be printed?

import time
start = time.time()
for _ in range(1000000):
    pass
end = time.time()
print(round(end - start, 2))

medium

A. A number close to 1.00

B. An error because of wrong syntax

C. A number close to 10.00

D. A number close to 0.00

Solution

Step 1: Understand the loop workload
The loop runs 1,000,000 times doing nothing (pass), which takes very little time due to Python's loop execution speed.
Step 2: Estimate time taken
On a normal computer, this empty loop takes around 0.03-0.1 seconds, so round(end - start, 2) prints a number close to 0.00.
Final Answer:
A number close to 0.00 -> Option D
Quick Check:
1M empty loops ~0.05s [OK]

Hint: 1 million empty loops take ~0.05 seconds [OK]

Common Mistakes:

Overestimating time for empty loop (e.g., thinking 1 second)
Thinking it takes 10 seconds
Assuming syntax error due to indentation

4. You tried pruning your AI model to reduce latency but latency increased. What is the likely cause?

medium

A. Pruning removed important layers causing slower computation

B. Pruning always increases latency by design

C. Pruning was done incorrectly causing overhead in model execution

D. Latency measurement was done before pruning

Solution

Step 1: Understand pruning effect
Pruning removes less important parts to speed up model, so latency should decrease if done right.
Step 2: Identify why latency increased
If latency increased, pruning likely added overhead or was done incorrectly, causing slower execution.
Final Answer:
Pruning was done incorrectly causing overhead in model execution -> Option C
Quick Check:
Incorrect pruning = more overhead = higher latency [OK]

Hint: Incorrect pruning adds overhead, increasing latency [OK]

Common Mistakes:

Assuming pruning always slows model
Ignoring measurement timing
Thinking pruning removes important layers by default

5. You want to reduce latency of a large AI model for mobile devices. Which combined approach is best?

hard

A. Use quantization to reduce precision and prune unimportant weights

B. Increase model layers and use caching on server

C. Only use caching without changing model size

D. Train a bigger model with more data

Solution

Step 1: Identify techniques for latency reduction on mobile
Quantization reduces number size to speed up computation; pruning removes unneeded parts to shrink model.
Step 2: Evaluate options
Increasing layers or bigger models increase latency; caching helps but alone is not enough for mobile constraints.
Final Answer:
Use quantization to reduce precision and prune unimportant weights -> Option A
Quick Check:
Quantization + pruning = best latency reduction [OK]

Hint: Combine quantization and pruning for mobile latency [OK]

Common Mistakes:

Thinking bigger models reduce latency
Relying only on caching for mobile speed
Ignoring model size impact on mobile devices

Latency optimization in Prompt Engineering / GenAI - Practice Problems & Coding Challenges

Start learning this pattern below

Practice

Solution

Step 1: Understand latency meaning

Step 2: Connect latency to user experience

Final Answer:

Quick Check:

Solution

Step 1: Identify correct time functions

Step 2: Check latency calculation

Final Answer:

Quick Check:

Solution

Step 1: Understand the loop workload

Step 2: Estimate time taken

Final Answer:

Quick Check:

Solution

Step 1: Understand pruning effect

Step 2: Identify why latency increased

Final Answer:

Quick Check:

Solution

Step 1: Identify techniques for latency reduction on mobile

Step 2: Evaluate options

Final Answer:

Quick Check: