Inference
The process of running a trained model to generate predictions or responses from new inputs.
Full Definition
Inference is the forward-pass computation that produces a model's output given an input, using fixed (not updated) weights. It is distinct from training, where weights are actively updated via backpropagation. Inference cost is measured in compute (FLOPs per token), latency (time to first token and time per token), and memory (VRAM to hold model weights plus KV cache). Inference optimisation is a large engineering discipline: quantisation, speculative decoding, batching, caching, and hardware-specific kernels (flash attention) all reduce cost and latency. As models are used by millions of users simultaneously, inference efficiency is as important as model quality for production deployment.
Examples
An API endpoint receiving a user's chat message, running a forward pass through a 70B parameter model, and returning a streamed response in under 2 seconds.
Using speculative decoding to reduce Llama 3 70B inference latency by having a smaller draft model propose token sequences for the large model to verify.
Apply this in your prompts
PromptITIN automatically uses techniques like Inference to build better prompts for you.