Attention Mechanism
The core transformer operation that weighs the relevance of each token to every other token in a sequence.
Full Definition
The attention mechanism computes, for each token in a sequence, a weighted sum over all other tokens — where the weights (attention scores) represent how relevant each other token is to the current one. Scores are computed as the dot product of query and key vectors, scaled and normalised via softmax. The result is a context-aware representation of each token that incorporates information from the entire sequence simultaneously. Multi-head attention runs this process in parallel across multiple representation subspaces, capturing different types of relationships (syntactic, semantic, coreference) at once. Attention replaced recurrence and convolution as the primary sequence processing operation because it parallelises better and captures long-range dependencies more effectively.
Examples
When processing 'The animal didn't cross the street because it was too tired', attention correctly links 'it' to 'animal' rather than 'street'.
A cross-attention layer in a translation model attending to relevant source-language words while generating each target-language word.
Apply this in your prompts
PromptITIN automatically uses techniques like Attention Mechanism to build better prompts for you.
Related Terms
Self-Attention
An attention operation where a sequence attends to itself, allowing each token t…
View →Transformer
The neural network architecture that underpins all modern large language models,…
View →Positional Encoding
A mechanism that injects token position information into transformer inputs, sin…
View →