Which statement best describes the difference between tokenization in NLTK and spaCy?
Think about how each library approaches breaking text into pieces.
NLTK primarily uses rule-based tokenizers like word_tokenize which rely on regular expressions and rules. spaCy uses statistical models trained on large corpora to tokenize more accurately, handling edge cases better.
What is the output of this code snippet?
import spacy nlp = spacy.load('en_core_web_sm') doc = nlp('Apple is looking at buying U.K. startup for $1 billion') entities = [(ent.text, ent.label_) for ent in doc.ents] print(entities)
Check spaCy's entity labels for organizations, geopolitical entities, and money.
spaCy labels 'Apple' as an organization (ORG), 'U.K.' as a geopolitical entity (GPE), and '$1 billion' as money (MONEY).
You want to perform sentiment analysis on movie reviews using Hugging Face transformers. Which model is the best choice?
Look for a model fine-tuned specifically for sentiment tasks.
Option D is a DistilBERT model fine-tuned on the SST-2 dataset for sentiment analysis, making it the best choice. The others are general language models or not fine-tuned for sentiment.
During fine-tuning a Hugging Face transformer model, what is the most likely effect of setting the learning rate too high?
Think about what happens when updates are too large during training.
A too high learning rate causes large weight updates, making training unstable and preventing the loss from decreasing smoothly.
What error or unexpected output will this code produce?
from nltk.tokenize import word_tokenize text = "Hello, world! Let's test tokenization." tokens = word_tokenize(text) print(tokens[10])
Count how many tokens are produced and check the index accessed.
The tokenized list has fewer than 11 tokens, so accessing index 10 causes an IndexError.