Overview - Tokenization and vocabulary
What is it?
Tokenization is the process of breaking text into smaller pieces called tokens, which can be words, parts of words, or characters. Vocabulary is the collection of all unique tokens that a model knows and uses to understand and generate language. Together, tokenization and vocabulary help machines read and work with human language by turning sentences into manageable pieces.
Why it matters
Without tokenization and vocabulary, machines would see text as long, confusing strings of letters with no clear meaning. This would make it impossible for AI to understand, learn from, or generate language effectively. Tokenization and vocabulary let AI models handle language in a structured way, enabling everything from chatbots to translation tools to work well.
Where it fits
Before learning tokenization and vocabulary, you should understand basic text data and how computers represent information. After this, you can learn about embedding tokens into numbers and how models use these embeddings to learn language patterns.