Technical

Streaming

Sending model output tokens to the client incrementally as they are generated rather than all at once.

Full Definition

Streaming delivers model outputs token-by-token (or in small chunks) over a persistent HTTP connection (typically using Server-Sent Events or WebSockets) rather than waiting for the entire response to be generated before sending. This dramatically improves perceived latency: users see the first words in under a second rather than waiting 10–30 seconds for a long response to complete. Streaming is now the default mode for all major LLM APIs and consumer products. Implementing streaming on the client side requires handling partial JSON, managing UI state for incremental text rendering, and gracefully recovering from interrupted streams.

Examples

ChatGPT's typing-cursor effect, where words appear word-by-word in real time, is implemented via streaming from OpenAI's API.

A code generation tool showing the generated function appearing line by line while the model is still generating later lines.

Apply this in your prompts

Prompt𝙸t𝙸n automatically uses techniques like Streaming to build better prompts for you.

✦ Try it free

Related Terms

Inference

The process of running a trained model to generate predictions or responses from…

View →

Latency

The time delay between sending a request to a model and receiving its first toke…

View →

API

An Application Programming Interface that lets developers call AI model capabili…

View →

← Browse all 100 terms