Complete the code to load a document as a string.
with open('document.txt', 'r') as file: text = file.[1]()
readline() reads only one line, not the full document.readlines() returns a list of lines, not a single string.The read() method reads the entire file content as a single string, which is needed to process the whole document.
Complete the code to split the document text into sentences.
import nltk nltk.download('punkt') sentences = nltk.tokenize.[1](text)
word_tokenize splits text into words, not sentences.split is a basic string method and won't handle sentence boundaries properly.The sent_tokenize function splits text into sentences, which is essential for sentence-level processing.
Fix the error in the code to remove stopwords from the token list.
from nltk.corpus import stopwords stop_words = set(stopwords.words('english')) tokens = ['this', 'is', 'a', 'test'] filtered = [word for word in tokens if word [1] stop_words]
in keeps only stopwords, which is the opposite of what we want.We want to keep words that are not in the stopwords list, so we use not in.
Fill both blanks to create a dictionary of word counts from tokens.
word_counts = [1]() for word in tokens: word_counts[word] = word_counts.get(word, [2]) + 1
list() instead of dict() causes errors.We start with an empty dictionary dict(). The get method returns 0 if the word is not yet counted, then we add 1.
Fill all three blanks to create a TF-IDF vectorizer and transform documents.
from sklearn.feature_extraction.text import [1] vectorizer = [2](stop_words='english') X = vectorizer.[3](documents)
fit alone returns the model, not the transformed data.TfidfVectorizer converts text to TF-IDF features. We create an instance and then call fit_transform to learn and transform the documents.