Bird
Raised Fist0
NLPml~20 mins

One-hot encoding for text in NLP - Practice Problems & Coding Challenges

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Challenge - 5 Problems
🎖️
One-Hot Encoding Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
Predict Output
intermediate
2:00remaining
Output of one-hot encoding a small text corpus
What is the output of the following code that one-hot encodes a list of words?
NLP
from sklearn.preprocessing import OneHotEncoder
import numpy as np

words = np.array([['cat'], ['dog'], ['cat'], ['bird']])
encoder = OneHotEncoder(sparse=False)
encoded = encoder.fit_transform(words)
print(encoded)
A
[[0. 1. 0.]
 [0. 0. 1.]
 [0. 1. 0.]
 [1. 0. 0.]]
B
[[1. 0. 0.]
 [0. 1. 0.]
 [1. 0. 0.]
 [0. 0. 1.]]
C
[[0. 1. 0.]
 [1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]
D
[[1. 0. 0.]
 [0. 0. 1.]
 [1. 0. 0.]
 [0. 1. 0.]]
Attempts:
2 left
💡 Hint
Remember that OneHotEncoder assigns columns in alphabetical order of unique words.
🧠 Conceptual
intermediate
1:30remaining
Understanding one-hot encoding vocabulary size
If you one-hot encode a text corpus with 10,000 unique words, what will be the size of each one-hot vector?
AA vector of length 1 with the index of the word
B10,000 elements with multiple elements set to 1 depending on word frequency
CA vector of length equal to the number of words in the sentence
D10,000 elements with exactly one element set to 1 and the rest 0
Attempts:
2 left
💡 Hint
One-hot encoding creates a vector with one position for each unique word.
Hyperparameter
advanced
1:30remaining
Choosing one-hot encoding parameters for text data
Which parameter of sklearn's OneHotEncoder controls whether the output is a sparse matrix or a dense array?
Asparse
Bhandle_unknown
Ccategories
Ddrop
Attempts:
2 left
💡 Hint
This parameter decides the output format to save memory or not.
Metrics
advanced
1:30remaining
Evaluating one-hot encoded text input for a classification model
You trained a classifier on one-hot encoded text data. Which metric best measures how well the model predicts the correct class labels?
APerplexity
BAccuracy
CMean Squared Error
DSilhouette Score
Attempts:
2 left
💡 Hint
Think about classification performance metrics.
🔧 Debug
expert
2:00remaining
Debugging one-hot encoding with unseen words during inference
You trained a OneHotEncoder on a training set and saved it. At inference, you try to transform new text containing words not seen during training. What error will sklearn's OneHotEncoder raise by default?
AKeyError: word not found in vocabulary
BTypeError: unsupported operand type(s)
CValueError: Found unknown categories during transform
DIndexError: index out of range
Attempts:
2 left
💡 Hint
Check how OneHotEncoder handles unknown categories by default.

Practice

(1/5)
1. What does one-hot encoding do to words in text processing?
easy
A. Converts each word into a vector with one 1 and rest 0s
B. Replaces words with their synonyms
C. Counts the number of letters in each word
D. Sorts words alphabetically

Solution

  1. Step 1: Understand one-hot encoding concept

    One-hot encoding creates a vector for each word where only one position is 1 and all others are 0.
  2. Step 2: Compare options with definition

    Only Converts each word into a vector with one 1 and rest 0s matches this definition exactly.
  3. Final Answer:

    Converts each word into a vector with one 1 and rest 0s -> Option A
  4. Quick Check:

    One-hot encoding = vector with single 1 [OK]
Hint: One-hot means one 1 in vector, rest zeros [OK]
Common Mistakes:
  • Thinking it replaces words with synonyms
  • Confusing with counting letters
  • Assuming it sorts words
2. Which of the following is the correct Python syntax to create a one-hot vector for the word 'cat' from vocabulary ['cat', 'dog', 'bird']?
easy
A. one_hot = [0, 0, 1]
B. one_hot = [0, 1, 0]
C. one_hot = [1, 1, 0]
D. one_hot = [1, 0, 0]

Solution

  1. Step 1: Identify the index of 'cat' in vocabulary

    'cat' is at index 0 in ['cat', 'dog', 'bird'].
  2. Step 2: Create one-hot vector with 1 at index 0

    The vector should have 1 at position 0 and 0 elsewhere: [1, 0, 0].
  3. Final Answer:

    [1, 0, 0] -> Option D
  4. Quick Check:

    Index 0 gets 1 in one-hot vector [OK]
Hint: Index of word = position of 1 in vector [OK]
Common Mistakes:
  • Putting 1 in wrong index
  • Using multiple 1s in vector
  • Confusing word order in vocabulary
3. What will be the output of this Python code?
vocab = ['apple', 'banana', 'cherry']
word = 'banana'
one_hot = [1 if w == word else 0 for w in vocab]
print(one_hot)
medium
A. [1, 0, 0]
B. [0, 1, 0]
C. [0, 0, 1]
D. [1, 1, 0]

Solution

  1. Step 1: Understand list comprehension logic

    For each word in vocab, put 1 if it matches 'banana', else 0.
  2. Step 2: Apply to vocab list

    'apple' != 'banana' -> 0, 'banana' == 'banana' -> 1, 'cherry' != 'banana' -> 0, so [0, 1, 0].
  3. Final Answer:

    [0, 1, 0] -> Option B
  4. Quick Check:

    Only 'banana' gets 1 in vector [OK]
Hint: Check which vocab word equals target word [OK]
Common Mistakes:
  • Mixing up word positions
  • Using 1 for all words
  • Misreading list comprehension
4. Identify the error in this one-hot encoding code snippet:
vocab = ['red', 'green', 'blue']
word = 'green'
one_hot = [0 if w == word else 1 for w in vocab]
print(one_hot)
medium
A. The list comprehension syntax is invalid
B. The vocabulary list is missing a word
C. The condition is reversed; it should assign 1 when words match
D. The print statement syntax is incorrect

Solution

  1. Step 1: Analyze the list comprehension condition

    It assigns 0 if word matches, else 1, which is opposite of one-hot logic.
  2. Step 2: Correct logic for one-hot encoding

    One-hot should assign 1 when words match and 0 otherwise.
  3. Final Answer:

    The condition is reversed; it should assign 1 when words match -> Option C
  4. Quick Check:

    Match word -> 1, else 0 [OK]
Hint: One-hot sets 1 for match, not 0 [OK]
Common Mistakes:
  • Reversing 0 and 1 in condition
  • Assuming syntax error instead of logic error
  • Ignoring correct vocabulary
5. Given a vocabulary ['sun', 'moon', 'star'] and a sentence 'moon star sun star', which one-hot encoded matrix correctly represents the sentence?
hard
A. [[0,1,0],[0,0,1],[1,0,0],[0,0,1]]
B. [[1,0,0],[0,1,0],[0,0,1],[0,1,0]]
C. [[0,0,1],[1,0,0],[0,1,0],[1,0,0]]
D. [[1,1,0],[0,0,1],[1,0,0],[0,0,1]]

Solution

  1. Step 1: Map each word to its one-hot vector

    Vocabulary indices: 'sun'->0, 'moon'->1, 'star'->2. So 'moon'=[0,1,0], 'star'=[0,0,1], 'sun'=[1,0,0].
  2. Step 2: Encode sentence words in order

    Sentence words: 'moon' -> [0,1,0], 'star' -> [0,0,1], 'sun' -> [1,0,0], 'star' -> [0,0,1].
  3. Final Answer:

    [[0,1,0],[0,0,1],[1,0,0],[0,0,1]] -> Option A
  4. Quick Check:

    Each word vector matches vocab index [OK]
Hint: Match word order and vocab index for vectors [OK]
Common Mistakes:
  • Mixing word order in sentence
  • Swapping indices of words
  • Using vectors with multiple 1s