What Tokens Are and Why They Matter
Tokens are the fundamental units that language models process — not exactly words, but word fragments. A simple word like 'cat' is typically one token. A longer word like 'understanding' might be two tokens (under + standing). A complex technical term or a word from an uncommon language might be multiple tokens. Punctuation, spaces, and special characters are also tokens. As a rough rule of thumb: 100 tokens ≈ 75 words in English. Tokens matter because AI models are priced per token (input + output), and they have a maximum number of tokens they can process in a single session. Understanding tokens helps you estimate costs and plan efficiently.
What a Context Window Is
The context window is the total number of tokens an AI model can process in a single session — the sum of your input (prompt + any pasted text) and its output (the response). Think of it as the model's working memory: everything within the context window is what the model can 'see' and reason about. Anything outside the window is inaccessible. If you have a 128,000-token context window and paste in a 100-page document (roughly 75,000 words), you've used about 100,000 tokens on the input alone, leaving 28,000 tokens for the model's response. Context window size is one of the most practically important differences between models.
Context Window Sizes Across Major Models
Context window sizes have grown dramatically. GPT-3.5 had a 4,096-token window — enough for a few paragraphs. GPT-4 Turbo has 128,000 tokens — enough for a full novel. Claude models support up to 200,000 tokens — enough for very large codebases or long research papers. Gemini 1.5 Pro supports up to 1 million tokens. These numbers change rapidly as models improve. The practical implication: tasks that required complex chunking and multiple queries a year ago can now be done in a single prompt. For most everyday tasks, any modern model's context window is large enough — the limits matter for long-document processing.
What Happens When You Exceed the Context Window
When your total tokens (input + output) approach or exceed the context window, earlier content is either truncated (cut off) or the model starts producing lower-quality responses as it struggles to maintain coherence across too much context. In a multi-turn conversation, this means the model 'forgets' earlier messages as the context fills up. In a document analysis task, the beginning of a long document may receive less attention than the end. The practical fix: for very long documents, break them into focused sections and query each separately. For long conversations, start fresh when output quality degrades.
Tokens, Pricing, and Cost Estimation
Most commercial AI APIs charge per token — separately for input tokens and output tokens, with output usually priced higher. Typical rates range from fractions of a cent per 1,000 tokens for efficient models to several cents per 1,000 tokens for the most capable frontier models. For casual use, this rarely matters. For applications processing thousands of requests, token efficiency becomes economically significant. A prompt with a 500-token system prompt, 200-token user input, and 400-token response uses 1,100 tokens total. At $0.01 per 1,000 tokens, that's $0.011 per request — manageable individually, but $110 per 10,000 requests. Optimizing prompt length matters at scale.