Technical

Attention Mechanism

The core transformer operation that weighs the relevance of each token to every other token in a sequence.

Full Definition

The attention mechanism computes, for each token in a sequence, a weighted sum over all other tokens — where the weights (attention scores) represent how relevant each other token is to the current one. Scores are computed as the dot product of query and key vectors, scaled and normalised via softmax. The result is a context-aware representation of each token that incorporates information from the entire sequence simultaneously. Multi-head attention runs this process in parallel across multiple representation subspaces, capturing different types of relationships (syntactic, semantic, coreference) at once. Attention replaced recurrence and convolution as the primary sequence processing operation because it parallelises better and captures long-range dependencies more effectively.

Examples

When processing 'The animal didn't cross the street because it was too tired', attention correctly links 'it' to 'animal' rather than 'street'.

A cross-attention layer in a translation model attending to relevant source-language words while generating each target-language word.

Apply this in your prompts

Prompt𝙸t𝙸n automatically uses techniques like Attention Mechanism to build better prompts for you.

✦ Try it free

Related Terms

Self-Attention

An attention operation where a sequence attends to itself, allowing each token t…

View →

Transformer

The neural network architecture that underpins all modern large language models,…

View →

Positional Encoding

A mechanism that injects token position information into transformer inputs, sin…

View →

← Browse all 100 terms