Technical

Positional Encoding

A mechanism that injects token position information into transformer inputs, since attention is order-agnostic.

Full Definition

Transformers process all tokens in parallel and have no built-in notion of sequence order — without positional encoding, 'dog bites man' and 'man bites dog' would produce identical attention patterns. Positional encodings add position-specific signals to token embeddings. The original transformer used fixed sinusoidal functions; modern models use learned absolute positions or relative positional encodings. RoPE (Rotary Position Embedding), used in Llama and GPT-NeoX, applies a rotation to query and key vectors that encodes relative distance, scales elegantly to long contexts, and is critical for extending context windows via techniques like YaRN and LongRoPE.

Examples

The original 'Attention Is All You Need' sinusoidal positional encoding: PE(pos, 2i) = sin(pos / 10000^(2i/d_model)).

RoPE encoding allowing Llama 3 to be extended from an 8k training context to 128k at inference by adjusting the rotation frequency scaling factor.

Apply this in your prompts

Prompt𝙸t𝙸n automatically uses techniques like Positional Encoding to build better prompts for you.

✦ Try it free

Related Terms

Attention Mechanism

The core transformer operation that weighs the relevance of each token to every …

View →

Transformer

The neural network architecture that underpins all modern large language models,…

View →

Context Length

The maximum number of tokens a model can handle in a single forward pass.…

View →

← Browse all 100 terms