In a sentiment analysis pipeline, why do we perform tokenization on the input text?
Think about how a computer understands text before analyzing sentiment.
Tokenization breaks down text into smaller pieces such as words or subwords, which makes it easier for the model to process and understand the text.
Given the following Python code for preprocessing text in a sentiment analysis pipeline, what is the output?
import re text = "I love this product! It's amazing." cleaned = re.sub(r'[^a-zA-Z ]', '', text).lower().split() print(cleaned)
Look at how punctuation is removed and text is converted to lowercase.
The regex removes all characters except letters and spaces. Then the text is converted to lowercase and split into words, so "It's" becomes "its" without the apostrophe.
You want to build a sentiment analysis model that understands the context of words in a sentence. Which model architecture is most suitable?
Think about models that can remember previous words to understand meaning.
RNNs like LSTM or GRU can capture the order and context of words, which is important for understanding sentiment in sentences.
Your sentiment dataset has many more positive reviews than negative ones. Which evaluation metric should you prioritize?
Consider a metric that balances precision and recall.
F1-score balances precision and recall, making it better for imbalanced datasets where accuracy can be misleading.
Here is a snippet of a sentiment analysis model training code. The model always predicts the same sentiment class regardless of input. What is the most likely cause?
import torch import torch.nn as nn class SimpleSentimentModel(nn.Module): def __init__(self, vocab_size, embed_dim, num_classes): super().__init__() self.embedding = nn.Embedding(vocab_size, embed_dim) self.fc = nn.Linear(embed_dim, num_classes) def forward(self, x): embedded = self.embedding(x) # shape: (batch_size, seq_len, embed_dim) pooled = embedded.mean(dim=1) # average over seq_len out = self.fc(pooled) return out model = SimpleSentimentModel(vocab_size=1000, embed_dim=50, num_classes=2) # Training loop omitted for brevity # After training, model always predicts class 0.
Consider the effect of unbalanced classes on model predictions.
If the training data is heavily imbalanced, the model may learn to always predict the majority class to minimize loss, resulting in poor generalization.