beginner

What is the main purpose of a text cleaning pipeline?

A text cleaning pipeline prepares raw text data by removing unwanted parts and fixing errors so that the text is easier to analyze or use in models.

Click to reveal answer

beginner

Name three common steps in a text cleaning pipeline.

Common steps include: 1) Lowercasing all text, 2) Removing punctuation and special characters, 3) Removing stopwords (common words like 'the', 'and').

Click to reveal answer

beginner

Why do we remove stopwords in text cleaning?

Stopwords are very common words that usually do not add useful meaning for analysis, so removing them helps focus on important words.

Click to reveal answer

intermediate

What Python library is commonly used for text cleaning tasks?

The 're' library is used for regular expressions to find and replace patterns, and 'nltk' or 'spaCy' are popular for more advanced text processing.

Click to reveal answer

beginner

How does tokenization fit into a text cleaning pipeline?

Tokenization splits text into smaller pieces like words or sentences, making it easier to analyze or clean each part separately.

Click to reveal answer

Which step is NOT typically part of a text cleaning pipeline?

AAdding random characters

BRemoving punctuation

CConverting text to lowercase

DRemoving stopwords

What does tokenization do in text cleaning?

ASplits text into smaller parts like words

BChanges text to uppercase

CRemoves numbers from text

DCombines words into sentences

Why is lowercasing text important in cleaning?

ATo make text colorful

BTo reduce differences between words like 'Apple' and 'apple'

CTo remove punctuation

DTo add spaces between words

Which Python library helps with pattern matching in text cleaning?

Anumpy

Bmatplotlib

Cpandas

Dre

What are stopwords?

ANumbers in text

BRare words in text

CCommon words like 'the' and 'is' that add little meaning

DPunctuation marks

Describe the main steps you would include in a text cleaning pipeline and why each is important.

Explain how you would use Python to clean a text dataset before analysis.