When using regular expressions for text cleaning, the key metric is data quality improvement. This means how well the cleaning removes unwanted characters or patterns without losing important information. Metrics like precision and recall can be adapted here:
- Precision: How many of the removed parts were actually unwanted (true removals vs wrong removals).
- Recall: How many unwanted parts were successfully removed (true removals vs missed unwanted parts).
Good cleaning keeps useful text intact (high precision) and removes all noise (high recall). This helps models learn better from clean data.