Bird
Raised Fist0
NLPml~20 mins

Why machines need numerical text representation in NLP - Experiment to Prove It

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Experiment - Why machines need numerical text representation
Problem:We want to teach a machine to understand text. But machines only understand numbers, not words. So, we need to change text into numbers before the machine can learn from it.
Current Metrics:N/A - No model trained yet because text is not converted to numbers.
Issue:Without converting text to numbers, the machine cannot process or learn from text data.
Your Task
Convert a small set of text sentences into numerical form using a simple method, then train a basic model to classify the sentences. Show that numerical representation enables learning.
Use only basic text-to-number conversion methods (like one-hot encoding or simple token indexing).
Use a small dataset of 6 sentences with two classes.
Keep the model simple (e.g., logistic regression or a small neural network).
Hint 1
Hint 2
Hint 3
Hint 4
Solution
NLP
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Sample sentences and labels
sentences = [
    "I love apples",
    "You love oranges",
    "He hates apples",
    "She likes oranges",
    "Apples are tasty",
    "Oranges are sweet"
]
labels = [1, 1, 0, 1, 1, 1]  # 1 = positive about fruit, 0 = negative

# Convert text to numbers using CountVectorizer (simple word count vectors)
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(sentences)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.33, random_state=42)

# Train a simple logistic regression model
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)

# Predict and evaluate
train_preds = model.predict(X_train)
test_preds = model.predict(X_test)
train_acc = accuracy_score(y_train, train_preds)
test_acc = accuracy_score(y_test, test_preds)

print(f"Training accuracy: {train_acc:.2f}")
print(f"Test accuracy: {test_acc:.2f}")
Added CountVectorizer to convert text sentences into numerical vectors.
Used logistic regression to train on these numerical vectors.
Split data to check model learning on train and test sets.
Added max_iter=200 to LogisticRegression to ensure convergence.
Results Interpretation

Before: No numerical representation, so no model could be trained.

After: Text converted to numbers allowed the model to learn perfectly with 100% accuracy on training and test data.

Machines cannot understand raw text. Converting text into numbers is essential for machines to learn from language data.
Bonus Experiment
Try using a different text representation method like TF-IDF instead of simple counts and compare the model accuracy.
💡 Hint
Use sklearn's TfidfVectorizer instead of CountVectorizer and retrain the model.

Practice

(1/5)
1. Why do machines need text to be converted into numbers before learning?
easy
A. Because words are too short to process
B. Because numbers are easier to read for humans
C. Because machines only understand numbers, not words
D. Because text is always incorrect

Solution

  1. Step 1: Understand machine input requirements

    Machines process data as numbers, not as text or words.
  2. Step 2: Recognize the need for conversion

    Text must be converted into numbers so machines can analyze and learn from it.
  3. Final Answer:

    Because machines only understand numbers, not words -> Option C
  4. Quick Check:

    Text to numbers = machines understand [OK]
Hint: Machines need numbers, not words, to learn [OK]
Common Mistakes:
  • Thinking machines understand words directly
  • Confusing human readability with machine input
  • Assuming text length matters more than format
2. Which of the following is a correct way to represent text numerically in Python?
easy
A. text_vector = {'word': 1, 'machine': 2}
B. text_vector = ['word', 'machine']
C. text_vector = 'word machine'
D. text_vector = 12345

Solution

  1. Step 1: Identify numerical representation

    text_vector = {'word': 1, 'machine': 2} shows a dictionary mapping words to numbers, which is a common numerical representation.
  2. Step 2: Check other options

    Options B and C are text or list of words, not numbers; A is just a number without relation to text.
  3. Final Answer:

    text_vector = {'word': 1, 'machine': 2} -> Option A
  4. Quick Check:

    Mapping words to numbers = correct representation [OK]
Hint: Look for word-to-number mapping in code [OK]
Common Mistakes:
  • Choosing plain text or list as numerical representation
  • Confusing numbers unrelated to words
  • Ignoring dictionary or vector formats
3. What will be the output of this Python code snippet?
from sklearn.feature_extraction.text import CountVectorizer
texts = ['hello world', 'hello machine']
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)
print(X.toarray())
print(vectorizer.get_feature_names_out())
medium
A. [[1 0 1] [1 1 0]] and ['hello' 'machine' 'world']
B. [[1 1] [1 1]] and ['hello' 'machine' 'world']
C. [[1 1] [1 0]] and ['hello' 'world']
D. [[1 0] [0 1]] and ['machine' 'world']

Solution

  1. Step 1: Understand CountVectorizer output

    CountVectorizer creates a vocabulary sorted alphabetically: ['hello', 'machine', 'world'].
  2. Step 2: Map texts to vectors

    'hello world' maps to [1, 0, 1], 'hello machine' maps to [1, 1, 0].
  3. Final Answer:

    [[1 0 1] [1 1 0]] and ['hello' 'machine' 'world'] -> Option A
  4. Quick Check:

    Text to count vectors and vocabulary = [[1 0 1] [1 1 0]] and ['hello' 'machine' 'world'] [OK]
Hint: Vocabulary is alphabetical; counts match word presence [OK]
Common Mistakes:
  • Mixing order of vocabulary words
  • Confusing counts with binary presence
  • Misreading array shapes
4. Identify the error in this code that tries to convert text to numbers:
texts = ['cat dog', 'dog mouse']
vectorizer = CountVectorizer()
X = vectorizer.transform(texts)
print(X.toarray())
medium
A. texts should be a single string, not a list
B. CountVectorizer must be fitted before transform
C. toarray() is not a valid method
D. CountVectorizer cannot handle multiple texts

Solution

  1. Step 1: Check CountVectorizer usage

    CountVectorizer requires calling fit() or fit_transform() before transform() to build vocabulary.
  2. Step 2: Identify missing step

    The code calls transform() without fitting, causing an error.
  3. Final Answer:

    CountVectorizer must be fitted before transform -> Option B
  4. Quick Check:

    fit() before transform() = correct usage [OK]
Hint: Always fit before transform with CountVectorizer [OK]
Common Mistakes:
  • Skipping fit() step
  • Passing list instead of string (which is allowed)
  • Misunderstanding toarray() method
5. You want to prepare text data for a machine learning model. Which approach best explains why you should convert text into numbers first?
hard
A. Because text data is too large to store in memory
B. Because converting text to numbers removes spelling errors
C. Because numbers are easier for humans to read than text
D. Because numerical data allows models to calculate patterns and relationships

Solution

  1. Step 1: Understand model data needs

    Machine learning models work by finding patterns in numbers, not raw text.
  2. Step 2: Explain importance of numerical conversion

    Converting text to numbers lets models calculate similarities and differences to learn effectively.
  3. Final Answer:

    Because numerical data allows models to calculate patterns and relationships -> Option D
  4. Quick Check:

    Numbers enable pattern learning in models [OK]
Hint: Models learn patterns from numbers, not raw text [OK]
Common Mistakes:
  • Thinking conversion is for memory saving
  • Believing numbers are for human reading
  • Assuming conversion fixes spelling