Practice

(1/5)

1. Why do we preprocess raw text before using it in machine learning models?

easy

A. To make the text longer and more complex

B. To add more punctuation for clarity

C. To remove noise like punctuation and extra spaces

D. To change the meaning of the text

Solution

Step 1: Understand the purpose of preprocessing
Preprocessing cleans raw text by removing unwanted parts like punctuation and extra spaces.
Step 2: Connect cleaning to model quality
Clean text helps machine learning models understand the data better and perform well.
Final Answer:
To remove noise like punctuation and extra spaces -> Option C
Quick Check:
Preprocessing removes noise = A [OK]

Hint: Preprocessing cleans text by removing noise [OK]

Common Mistakes:

Thinking preprocessing adds complexity
Believing preprocessing changes text meaning
Assuming punctuation is always helpful

2. Which of the following is the correct way to convert all text to lowercase in Python preprocessing?

easy

A. text = text.lower()

B. text = text.capitalize()

C. text = text.upper()

D. text = text.title()

Solution

Step 1: Identify the method for lowercase conversion
Python's lower() method converts all characters in a string to lowercase.
Step 2: Compare with other methods
upper() makes text uppercase, capitalize() capitalizes first letter, title() capitalizes first letter of each word.
Final Answer:
text = text.lower() -> Option A
Quick Check:
Lowercase method = lower() = C [OK]

Hint: Use .lower() to convert text to lowercase [OK]

Common Mistakes:

Using upper() instead of lower()
Confusing capitalize() with lower()
Using title() which changes word capitalization

3. What will be the output of this Python code snippet for preprocessing?

text = "Hello, World!  "
clean_text = text.strip().lower().replace(',', '')
print(clean_text)

medium

A. "hello, world!"

B. "hello world"

C. "Hello, World!"

D. "hello world!"

Solution

Step 1: Apply strip() and lower()
strip() removes spaces at ends, lower() converts to lowercase, so "Hello, World! " becomes "hello, world!"
Step 2: Replace comma with empty string
replace(',', '') removes the comma, resulting in "hello world!"
Final Answer:
"hello world!" -> Option D
Quick Check:
strip + lower + replace comma = "hello world!" [OK]

Hint: Apply strip, lower, then replace to clean text [OK]

Common Mistakes:

Forgetting strip() removes spaces
Not removing comma correctly
Confusing case conversion order

4. Identify the error in this preprocessing code snippet:

text = "Example Text!"
clean_text = text.lower().strip().remove('!')
print(clean_text)

medium

A. remove() is not a string method

B. strip() should be called before lower()

C. lower() does not change the text

D. print() is missing parentheses

Solution

Step 1: Check string methods used
Python strings do not have a remove() method; to remove characters, replace() should be used.
Step 2: Verify other method usage
strip() and lower() are valid and order is acceptable; print() has parentheses.
Final Answer:
remove() is not a string method -> Option A
Quick Check:
remove() invalid for strings = D [OK]

Hint: Use replace() to remove chars, not remove() [OK]

Common Mistakes:

Using remove() instead of replace()
Thinking strip() must come before lower()
Ignoring syntax errors in print()

5. You have a dataset with inconsistent casing, extra spaces, and punctuation. Which sequence of preprocessing steps best cleans the text for a machine learning model?

hard

A. Convert to lowercase, strip spaces, remove punctuation

B. Strip spaces, remove punctuation, convert to lowercase

C. Remove punctuation, convert to lowercase, strip spaces

D. Remove punctuation, strip spaces, convert to uppercase

Solution

Step 1: Start by removing extra spaces
Stripping spaces first cleans the text edges, making punctuation removal accurate.
Step 2: Remove punctuation and convert to lowercase
Removing punctuation after spaces avoids leftover spaces; converting to lowercase last ensures uniform casing.
Final Answer:
Strip spaces, remove punctuation, convert to lowercase -> Option B
Quick Check:
Clean edges, remove noise, unify case = A [OK]

Hint: Strip spaces first, then remove punctuation, then lowercase [OK]

Common Mistakes:

Changing case before removing spaces
Removing punctuation before stripping spaces
Converting to uppercase instead of lowercase

Why preprocessing cleans raw text in NLP - The Real Reasons

Start learning this pattern below

Practice

Solution

Step 1: Understand the purpose of preprocessing

Step 2: Connect cleaning to model quality

Final Answer:

Quick Check:

Solution

Step 1: Identify the method for lowercase conversion

Step 2: Compare with other methods

Final Answer:

Quick Check:

Solution

Step 1: Apply strip() and lower()

Step 2: Replace comma with empty string

Final Answer:

Quick Check:

Solution

Step 1: Check string methods used

Step 2: Verify other method usage

Final Answer:

Quick Check:

Solution

Step 1: Start by removing extra spaces

Step 2: Remove punctuation and convert to lowercase

Final Answer:

Quick Check: