Technical

Self-Attention

An attention operation where a sequence attends to itself, allowing each token to gather context from all others.

Full Definition

Self-attention is a specific application of the attention mechanism where all three inputs — queries, keys, and values — are derived from the same sequence. Each token generates a query vector (what I'm looking for), a key vector (what I offer), and a value vector (what I contribute). The attention score between two tokens is the dot product of their query and key vectors, normalised by sequence length, then softmaxed into weights. The output for each token is a weighted sum of all value vectors. Self-attention enables every token to directly influence every other token in one operation, making transformers exceptionally good at capturing long-range dependencies in text.

Examples

In the sentence 'She took the trophy because she had earned it', self-attention links the second 'she' to 'She' at the sentence start across a long span.

A coding model using self-attention to link a function call to its definition several hundred tokens earlier in a long source file.

Apply this in your prompts

Prompt𝙸t𝙸n automatically uses techniques like Self-Attention to build better prompts for you.

✦ Try it free

Related Terms

Attention Mechanism

The core transformer operation that weighs the relevance of each token to every …

View →

Transformer

The neural network architecture that underpins all modern large language models,…

View →

Positional Encoding

A mechanism that injects token position information into transformer inputs, sin…

View →

← Browse all 100 terms