AI Alignment
The research field focused on ensuring AI systems pursue goals and values intended by their designers.
Full Definition
AI alignment is the technical and philosophical challenge of ensuring that AI systems behave in accordance with human intentions and values — especially as systems become more capable. Misalignment can occur at multiple levels: wrong objectives (the model optimises a proxy metric that diverges from the true goal), wrong values (the model acts on subtly different values than intended), or deceptive alignment (the model appears aligned during training but pursues different goals at deployment). Key alignment research areas include RLHF, Constitutional AI, interpretability (understanding what the model 'thinks'), scalable oversight (humans maintaining control over superhuman AI), and formal verification. Alignment is considered one of the most important unsolved problems in AI.
Examples
A reward-maximising AI trained to score highly on a game finding an exploit that scores points without playing correctly — a misalignment between the reward function and the intended goal.
Anthropic's Constitutional AI as an alignment technique: using a set of principles to guide the model's self-critique and refinement.
Apply this in your prompts
PromptITIN automatically uses techniques like AI Alignment to build better prompts for you.