Token
The basic unit of text a language model processes, roughly corresponding to a word or word fragment.
Full Definition
Tokens are the discrete units into which text is split before being fed to a language model. A tokeniser (typically BPE — Byte Pair Encoding) maps text to integer IDs from a fixed vocabulary. Common English words are single tokens; rare words split into subword pieces ('unhappiness' → 'un', 'happiness'); individual characters may each be a token for very rare strings. Approximately 1 token ≈ 4 characters ≈ 0.75 words in English, though this varies by language. Tokens are the unit of billing for API usage, the constraint on context window size, and the fundamental currency of LLM computation. Understanding tokenisation is essential for predicting cost, fitting content within context limits, and avoiding token boundary effects.
Examples
'Hello, world!' tokenises to ['Hello', ',', 'world', '!'] — 4 tokens — in GPT-4's tokeniser (tiktoken).
A 1,500-word essay containing approximately 2,000 tokens, costing $0.002 with GPT-4o at $1 per 1M input tokens.
Apply this in your prompts
PromptITIN automatically uses techniques like Token to build better prompts for you.