Recall & Review
beginner
What is the main purpose of a text cleaning pipeline?
A text cleaning pipeline prepares raw text data by removing unwanted parts and fixing errors so that the text is easier to analyze or use in models.
Click to reveal answer
beginner
Name three common steps in a text cleaning pipeline.
Common steps include: 1) Lowercasing all text, 2) Removing punctuation and special characters, 3) Removing stopwords (common words like 'the', 'and').
Click to reveal answer
beginner
Why do we remove stopwords in text cleaning?
Stopwords are very common words that usually do not add useful meaning for analysis, so removing them helps focus on important words.
Click to reveal answer
intermediate
What Python library is commonly used for text cleaning tasks?
The 're' library is used for regular expressions to find and replace patterns, and 'nltk' or 'spaCy' are popular for more advanced text processing.
Click to reveal answer
beginner
How does tokenization fit into a text cleaning pipeline?
Tokenization splits text into smaller pieces like words or sentences, making it easier to analyze or clean each part separately.
Click to reveal answer
Which step is NOT typically part of a text cleaning pipeline?
✗ Incorrect
Adding random characters would make the text messier, not cleaner.
What does tokenization do in text cleaning?
✗ Incorrect
Tokenization breaks text into smaller pieces such as words or sentences.
Why is lowercasing text important in cleaning?
✗ Incorrect
Lowercasing helps treat words with different cases as the same word.
Which Python library helps with pattern matching in text cleaning?
✗ Incorrect
The 're' library is used for regular expressions to find and replace text patterns.
What are stopwords?
✗ Incorrect
Stopwords are common words that usually do not add useful meaning.
Describe the main steps you would include in a text cleaning pipeline and why each is important.
Think about how each step makes text easier to analyze.
You got /5 concepts.
Explain how you would use Python to clean a text dataset before analysis.
Imagine you have a messy text file and want to prepare it for counting word frequencies.
You got /4 concepts.