import string sentences = ["AI is fun!", "Fun with AI and machine learning.", "Learning AI is exciting."] vocab = set() for sent in sentences: sent = sent.lower() sent = sent.translate(str.maketrans('', '', string.punctuation)) tokens = sent.split() vocab.update(tokens) print(len(vocab))

Practice

(1/5)

1. What is the main purpose of a text preprocessing pipeline in NLP?

easy

A. To train the machine learning model directly

B. To generate new text data automatically

C. To clean and prepare text data step-by-step for models

D. To visualize text data in graphs

Solution

Step 1: Understand the role of preprocessing
Preprocessing cleans and prepares raw text so models can understand it better.
Step 2: Identify pipeline benefits
Pipelines organize these steps neatly and make the process repeatable.
Final Answer:
To clean and prepare text data step-by-step for models -> Option C
Quick Check:
Preprocessing pipeline = clean and prepare text [OK]

Hint: Pipelines organize cleaning steps before modeling [OK]

Common Mistakes:

Confusing preprocessing with model training
Thinking pipelines generate new text
Assuming pipelines visualize data

2. Which of the following is the correct way to chain text preprocessing steps in Python using a pipeline?

easy

A. pipeline = [tokenize, lowercase, remove_stopwords]

B. pipeline = Pipeline(steps=[('tokenize', tokenize), ('lowercase', lowercase), ('stop', remove_stopwords)])

C. pipeline = tokenize + lowercase + remove_stopwords

D. pipeline = tokenize.lowercase.remove_stopwords()

Solution

Step 1: Recognize pipeline syntax
In Python, pipelines are often created using a Pipeline class with named steps.
Step 2: Check options
pipeline = Pipeline(steps=[('tokenize', tokenize), ('lowercase', lowercase), ('stop', remove_stopwords)]) correctly uses Pipeline with steps as tuples of (name, function).
Final Answer:
pipeline = Pipeline(steps=[('tokenize', tokenize), ('lowercase', lowercase), ('stop', remove_stopwords)]) -> Option B
Quick Check:
Pipeline uses steps list with (name, function) tuples [OK]

Hint: Use Pipeline class with named steps list [OK]

Common Mistakes:

Trying to chain functions with dots or plus signs
Not naming steps in the pipeline
Using list of functions without Pipeline wrapper

3. Given the following code snippet, what will be the output of processed_text?

def lowercase(text):
    return text.lower()

def remove_punctuation(text):
    return ''.join(c for c in text if c.isalnum() or c.isspace())

text = "Hello, World!"

pipeline = [lowercase, remove_punctuation]

processed_text = text
for step in pipeline:
    processed_text = step(processed_text)

print(processed_text)

medium

A. hello world

B. Hello World

C. hello, world!

D. HELLO WORLD

Solution

Step 1: Apply lowercase function
"Hello, World!" becomes "hello, world!" after lowercase.
Step 2: Apply remove_punctuation function
Removes commas and exclamation marks, leaving "hello world".
Final Answer:
hello world -> Option A
Quick Check:
Lowercase + remove punctuation = "hello world" [OK]

Hint: Apply steps one by one on text [OK]

Common Mistakes:

Forgetting to lowercase before removing punctuation
Assuming punctuation remains
Confusing case sensitivity

4. Identify the error in this text preprocessing pipeline code and select the fix:

def tokenize(text):
    return text.split()

def remove_stopwords(words):
    stopwords = ['the', 'is', 'at']
    return [w for w in words if w not in stopwords]

text = "The cat is at the door"

pipeline = [tokenize, remove_stopwords]

processed = text
for step in pipeline:
    processed = step(processed)

print(processed)

medium

A. Define stopwords outside the function

B. Add join after remove_stopwords to convert list back to string

C. Replace split() with list() in tokenize

D. Change text to lowercase before tokenizing

Solution

Step 1: Analyze stopwords matching
Stopwords are lowercase but input text has capitalized words, so matching fails.
Step 2: Fix by lowercasing text before tokenizing
Lowercasing ensures stopwords match and are removed correctly.
Final Answer:
Change text to lowercase before tokenizing -> Option D
Quick Check:
Lowercase text first to match stopwords [OK]

Hint: Lowercase text before removing stopwords [OK]

Common Mistakes:

Ignoring case mismatch in stopwords
Trying to join list without need
Changing split() to list() incorrectly

5. You want to build a text preprocessing pipeline that: 1. Converts text to lowercase 2. Removes punctuation 3. Tokenizes text into words 4. Removes stopwords Which of the following pipeline orders is correct to ensure proper processing?

hard

A. Lowercase -> Remove punctuation -> Tokenize -> Remove stopwords

B. Tokenize -> Lowercase -> Remove stopwords -> Remove punctuation

C. Remove stopwords -> Tokenize -> Lowercase -> Remove punctuation

D. Remove punctuation -> Remove stopwords -> Tokenize -> Lowercase

Solution

Step 1: Start with lowercase
Lowercasing first ensures uniform text for all later steps.
Step 2: Remove punctuation before tokenizing
Removing punctuation cleans text so tokens are words only.
Step 3: Tokenize then remove stopwords
Tokenizing splits text into words, then stopwords can be removed from tokens.
Final Answer:
Lowercase -> Remove punctuation -> Tokenize -> Remove stopwords -> Option A
Quick Check:
Correct pipeline order = A [OK]

Hint: Lowercase, clean, tokenize, then filter stopwords [OK]

Common Mistakes:

Tokenizing before cleaning punctuation
Removing stopwords before tokenizing
Not lowercasing first

Text preprocessing pipelines in NLP - Practice Problems & Coding Challenges

Start learning this pattern below

Practice

Solution

Step 1: Understand the role of preprocessing

Step 2: Identify pipeline benefits

Final Answer:

Quick Check:

Solution

Step 1: Recognize pipeline syntax

Step 2: Check options

Final Answer:

Quick Check:

Solution

Step 1: Apply lowercase function

Step 2: Apply remove_punctuation function

Final Answer:

Quick Check:

Solution

Step 1: Analyze stopwords matching

Step 2: Fix by lowercasing text before tokenizing

Final Answer:

Quick Check:

Solution

Step 1: Start with lowercase

Step 2: Remove punctuation before tokenizing

Step 3: Tokenize then remove stopwords

Final Answer:

Quick Check: