When working with text data that includes Unicode characters, the key metric to focus on is tokenization accuracy. This measures how well the model or preprocessing splits text into meaningful units (tokens) without breaking or losing Unicode characters. Good tokenization ensures the model understands the text correctly, especially for languages with special characters or emojis.
Additionally, character-level error rate is important. It shows how many Unicode characters are misread or misrepresented during processing. This matters because even a small mistake in Unicode can change the meaning of words or sentences.