When dealing with words not seen during training, the key metric is coverage -- how many words in new data are recognized by the model's vocabulary. Low coverage means many words are unknown, which can hurt predictions.
Besides coverage, accuracy or F1 score on tasks like text classification or named entity recognition show how well the model handles unknown words indirectly.
We also look at embedding quality for unknown words, often measured by downstream task performance or similarity scores.