For PII detection, Recall is very important because we want to find as many personal details as possible to protect privacy. Missing a PII means sensitive data leaks. Precision also matters because marking too many words as PII causes unnecessary redaction, making text hard to read. So, we balance both using the F1 score, which combines precision and recall into one number.
PII detection and redaction in Prompt Engineering / GenAI - Model Metrics & Evaluation
Start learning this pattern below
Jump into concepts and practice - no test required
| Predicted PII | Predicted Non-PII |
|---------------|-------------------|
| True Positive (TP) | False Positive (FP) |
| False Negative (FN) | True Negative (TN) |
Example:
TP = 80 (correctly found PII)
FP = 10 (wrongly marked non-PII as PII)
FN = 20 (missed PII)
TN = 890 (correctly ignored non-PII)
Total samples = 80 + 10 + 20 + 890 = 1000
From this, we calculate:
- Precision = 80 / (80 + 10) = 0.89
- Recall = 80 / (80 + 20) = 0.80
- F1 score = 2 * (0.89 * 0.80) / (0.89 + 0.80) ≈ 0.84
If we focus too much on precision, we only mark PII when very sure. This means fewer false alarms but we might miss some PII (low recall). For example, a system that only redacts very obvious phone numbers but misses nicknames or emails.
If we focus too much on recall, we catch almost all PII but also mark many normal words as PII (low precision). This makes the text hard to read because too many words are redacted.
Good PII detection balances both. For example, a system that finds 90% of PII (high recall) and keeps false alarms below 10% (high precision).
- Good: Precision ≥ 0.85, Recall ≥ 0.85, F1 ≥ 0.85. This means most PII is found and few false redactions.
- Bad: Precision < 0.5 or Recall < 0.5. This means many false alarms or many missed PII, both harmful.
- Accuracy is less useful here because most text is non-PII, so a model that marks nothing can have high accuracy but is useless.
- Accuracy paradox: Since most text is non-PII, a model that never detects PII can have high accuracy but zero recall.
- Data leakage: If test data contains PII seen during training, metrics look better but model fails on new data.
- Overfitting: Model memorizes specific PII patterns but misses new types, causing low recall in real use.
- Ignoring context: Some words are PII only in certain contexts; metrics must consider this to avoid false positives.
Your PII detection model has 98% accuracy but only 12% recall on PII. Is it good for production? Why or why not?
Answer: No, it is not good. The high accuracy is misleading because most text is non-PII. The very low recall means it misses 88% of PII, risking privacy leaks. For PII detection, recall must be high to protect sensitive data.
Practice
Solution
Step 1: Understand PII detection
PII detection is about finding personal information like names, emails, or phone numbers in text.Step 2: Identify the purpose
The goal is to protect privacy by recognizing sensitive data that should not be shared openly.Final Answer:
To find personal information to protect privacy -> Option CQuick Check:
PII detection = find personal info [OK]
- Confusing PII detection with data translation
- Thinking it speeds up processing
- Believing it increases dataset size
Solution
Step 1: Understand redaction
Redaction means hiding sensitive info by replacing it with a placeholder, not deleting or changing it randomly.Step 2: Choose the correct method
Replacing the email with a clear placeholder like <EMAIL_REDACTED> keeps the text readable and safe.Final Answer:
Replace the email with <EMAIL_REDACTED> -> Option AQuick Check:
Redaction = replace sensitive info with placeholder [OK]
- Deleting whole sentences instead of redacting
- Replacing emails with unrelated data
- Highlighting instead of hiding
import re text = 'Contact me at john.doe@example.com or 123-456-7890.' redacted = re.sub(r'\S+@\S+\.\S+', '<EMAIL_REDACTED>', text) print(redacted)
What will be the output?
Solution
Step 1: Understand the regex pattern
The pattern '\S+@\S+\.\S+' matches email addresses (non-space chars @ non-space chars . non-space chars).Step 2: Apply substitution
The code replaces the email with '<EMAIL_REDACTED>' but leaves the phone number unchanged.Final Answer:
Contact me at <EMAIL_REDACTED> or 123-456-7890. -> Option DQuick Check:
Email replaced, phone unchanged = Contact me at <EMAIL_REDACTED> or 123-456-7890. [OK]
- Thinking phone number is replaced
- Misreading regex pattern
- Assuming no replacement happens
import re
text = 'Call 555-1234 or 555-5678.'
redacted = re.sub(r'\d{3}-\d{4}', '<PHONE_REDACTED>', text)
print(redacted)But the output is:
'Call 555-1234 or 555-5678.'
What is the likely error?
Solution
Step 1: Check regex pattern against phone format
The pattern '\d{3}-\d{4}' matches numbers like '555-1234', but the phone numbers might have different formats or extra spaces.Step 2: Confirm if pattern matches text
If the phone numbers have area codes or spaces, the pattern won't match, so no replacement occurs.Final Answer:
The regex pattern is incorrect and does not match the phone numbers -> Option AQuick Check:
Regex mismatch causes no replacement [OK]
- Assuming re.sub syntax error
- Forgetting parentheses in print (Python 3+)
- Thinking text is empty without checking
Solution
Step 1: Understand regex for emails and phones
The email pattern '\S+@\S+\.\S+' matches emails; '\d{3}-\d{3}-\d{4}' matches US phone numbers like '123-456-7890'.Step 2: Combine patterns with OR operator
Using '|' between patterns matches either emails or phone numbers separately.Step 3: Evaluate options
r'\S+@\S+\.\S+|\d{3}-\d{3}-\d{4}' correctly uses '|' to combine patterns; r'\d{3}-\d{4}|\S+@\S+\.\S+' reverses order but still works; r'\S+@\S+\.\S+\d{3}-\d{3}-\d{4}' concatenates patterns (wrong); r'\S+@\S+\.\S+&\d{3}-\d{3}-\d{4}' uses '&' which is invalid in regex.Final Answer:
r'\S+@\S+\.\S+|\d{3}-\d{3}-\d{4}' -> Option BQuick Check:
Use '|' to combine regex patterns [OK]
- Concatenating patterns without '|'
- Using invalid regex operators like '&'
- Mixing order but forgetting OR operator
