Text data often needs special handling in data science. Which reason below best explains why text data is different from numeric data?
Think about how words can have different forms and meanings compared to simple numbers.
Text data is complex because words can have multiple meanings, spellings, and forms. This makes it harder to analyze directly compared to numeric data, which is straightforward to compute with.
What is the output of this Python code that splits a sentence into words?
sentence = "Data science is fun!" tokens = sentence.split() print(tokens)
Remember that split() splits by spaces and keeps punctuation attached to words.
The split() method splits the string at spaces, so the exclamation mark stays attached to 'fun!'.
What is the resulting list after converting all words in this list to lowercase?
words = ['Python', 'Data', 'SCIENCE', 'Fun'] lower_words = [w.lower() for w in words] print(lower_words)
Think about what the lower() method does to each string.
The lower() method converts all uppercase letters to lowercase, so all words become lowercase.
What error will this code raise when trying to remove punctuation from a text string?
import string text = "Hello, world!" clean_text = text.replace(string.punctuation, '') print(clean_text)
Check what type string.punctuation is and what replace() expects.
The replace() method expects a substring to replace, but string.punctuation is a string of multiple punctuation characters, so it does not match any substring exactly. This causes a TypeError because replace expects a string, but the usage is incorrect.
You want to prepare customer reviews for sentiment analysis. Which approach below best handles the text data before feeding it into a machine learning model?
Think about common steps in text cleaning for machine learning.
Lowercasing, removing punctuation, tokenizing, and removing stopwords are standard steps to clean text and reduce noise before analysis.