Models

Transformer

The neural network architecture that underpins all modern large language models, based on self-attention.

Full Definition

The transformer architecture, introduced in the 2017 Google paper 'Attention Is All You Need', replaced recurrent networks as the dominant sequence modelling approach. Its core innovation is the self-attention mechanism, which allows every token in a sequence to directly attend to every other token in a single operation, capturing long-range dependencies that RNNs struggled with. The architecture consists of stacked encoder and decoder layers, each containing multi-head self-attention and feed-forward sub-layers with residual connections and layer normalisation. GPT models use decoder-only transformers; BERT uses encoder-only; T5 uses encoder-decoder. Virtually every frontier LLM today is a scaled-up transformer.

Examples

GPT-4's decoder-only transformer predicting the next token by attending over all previous tokens in the context window simultaneously.

BERT using bidirectional encoder-only transformers to understand context from both sides of a masked word during pre-training.

Apply this in your prompts

Prompt𝙸t𝙸n automatically uses techniques like Transformer to build better prompts for you.

✦ Try it free

Related Terms

Attention Mechanism

The core transformer operation that weighs the relevance of each token to every …

View →

Self-Attention

An attention operation where a sequence attends to itself, allowing each token t…

View →

Large Language Model

A neural network with billions of parameters trained on text to understand and g…

View →

← Browse all 100 terms