Technical

Token

The basic unit of text a language model processes, roughly corresponding to a word or word fragment.

Full Definition

Tokens are the discrete units into which text is split before being fed to a language model. A tokeniser (typically BPE — Byte Pair Encoding) maps text to integer IDs from a fixed vocabulary. Common English words are single tokens; rare words split into subword pieces ('unhappiness' → 'un', 'happiness'); individual characters may each be a token for very rare strings. Approximately 1 token ≈ 4 characters ≈ 0.75 words in English, though this varies by language. Tokens are the unit of billing for API usage, the constraint on context window size, and the fundamental currency of LLM computation. Understanding tokenisation is essential for predicting cost, fitting content within context limits, and avoiding token boundary effects.

Examples

'Hello, world!' tokenises to ['Hello', ',', 'world', '!'] — 4 tokens — in GPT-4's tokeniser (tiktoken).

A 1,500-word essay containing approximately 2,000 tokens, costing $0.002 with GPT-4o at $1 per 1M input tokens.

Apply this in your prompts

Prompt𝙸t𝙸n automatically uses techniques like Token to build better prompts for you.

✦ Try it free

Related Terms

Tokenisation

The process of splitting text into tokens using a vocabulary and encoding algori…

View →

Context Length

The maximum number of tokens a model can handle in a single forward pass.…

View →

Context Window

The maximum number of tokens a model can process in a single input-output intera…

View →

← Browse all 100 terms