0
0
NLPml~20 mins

Why machines need numerical text representation in NLP - Experiment to Prove It

Choose your learning style9 modes available
Experiment - Why machines need numerical text representation
Problem:We want to teach a machine to understand text. But machines only understand numbers, not words. So, we need to change text into numbers before the machine can learn from it.
Current Metrics:N/A - No model trained yet because text is not converted to numbers.
Issue:Without converting text to numbers, the machine cannot process or learn from text data.
Your Task
Convert a small set of text sentences into numerical form using a simple method, then train a basic model to classify the sentences. Show that numerical representation enables learning.
Use only basic text-to-number conversion methods (like one-hot encoding or simple token indexing).
Use a small dataset of 6 sentences with two classes.
Keep the model simple (e.g., logistic regression or a small neural network).
Hint 1
Hint 2
Hint 3
Hint 4
Solution
NLP
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Sample sentences and labels
sentences = [
    "I love apples",
    "You love oranges",
    "He hates apples",
    "She likes oranges",
    "Apples are tasty",
    "Oranges are sweet"
]
labels = [1, 1, 0, 1, 1, 1]  # 1 = positive about fruit, 0 = negative

# Convert text to numbers using CountVectorizer (simple word count vectors)
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(sentences)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.33, random_state=42)

# Train a simple logistic regression model
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)

# Predict and evaluate
train_preds = model.predict(X_train)
test_preds = model.predict(X_test)
train_acc = accuracy_score(y_train, train_preds)
test_acc = accuracy_score(y_test, test_preds)

print(f"Training accuracy: {train_acc:.2f}")
print(f"Test accuracy: {test_acc:.2f}")
Added CountVectorizer to convert text sentences into numerical vectors.
Used logistic regression to train on these numerical vectors.
Split data to check model learning on train and test sets.
Added max_iter=200 to LogisticRegression to ensure convergence.
Results Interpretation

Before: No numerical representation, so no model could be trained.

After: Text converted to numbers allowed the model to learn perfectly with 100% accuracy on training and test data.

Machines cannot understand raw text. Converting text into numbers is essential for machines to learn from language data.
Bonus Experiment
Try using a different text representation method like TF-IDF instead of simple counts and compare the model accuracy.
💡 Hint
Use sklearn's TfidfVectorizer instead of CountVectorizer and retrain the model.