Technical

Inference

The process of running a trained model to generate predictions or responses from new inputs.

Full Definition

Inference is the forward-pass computation that produces a model's output given an input, using fixed (not updated) weights. It is distinct from training, where weights are actively updated via backpropagation. Inference cost is measured in compute (FLOPs per token), latency (time to first token and time per token), and memory (VRAM to hold model weights plus KV cache). Inference optimisation is a large engineering discipline: quantisation, speculative decoding, batching, caching, and hardware-specific kernels (flash attention) all reduce cost and latency. As models are used by millions of users simultaneously, inference efficiency is as important as model quality for production deployment.

Examples

An API endpoint receiving a user's chat message, running a forward pass through a 70B parameter model, and returning a streamed response in under 2 seconds.

Using speculative decoding to reduce Llama 3 70B inference latency by having a smaller draft model propose token sequences for the large model to verify.

Apply this in your prompts

Prompt𝙸t𝙸n automatically uses techniques like Inference to build better prompts for you.

✦ Try it free

Related Terms

Latency

The time delay between sending a request to a model and receiving its first toke…

View →

Throughput

The number of tokens or requests an inference system can process per unit of tim…

View →

Streaming

Sending model output tokens to the client incrementally as they are generated ra…

View →

← Browse all 100 terms