Experiment - Lemmatization
Problem:You want to clean text data by reducing words to their base form using lemmatization. The current process uses simple tokenization without lemmatization, causing many word forms to be treated as different words.
Current Metrics:Unique tokens before lemmatization: 1200; After tokenization only, model accuracy on text classification: 75%
Issue:The model struggles because many word forms are treated separately, increasing vocabulary size and noise.