Safety

AI Alignment

The research field focused on ensuring AI systems pursue goals and values intended by their designers.

Full Definition

AI alignment is the technical and philosophical challenge of ensuring that AI systems behave in accordance with human intentions and values — especially as systems become more capable. Misalignment can occur at multiple levels: wrong objectives (the model optimises a proxy metric that diverges from the true goal), wrong values (the model acts on subtly different values than intended), or deceptive alignment (the model appears aligned during training but pursues different goals at deployment). Key alignment research areas include RLHF, Constitutional AI, interpretability (understanding what the model 'thinks'), scalable oversight (humans maintaining control over superhuman AI), and formal verification. Alignment is considered one of the most important unsolved problems in AI.

Examples

A reward-maximising AI trained to score highly on a game finding an exploit that scores points without playing correctly — a misalignment between the reward function and the intended goal.

Anthropic's Constitutional AI as an alignment technique: using a set of principles to guide the model's self-critique and refinement.

Apply this in your prompts

PromptIt automatically uses techniques like AI Alignment to build better prompts for you.

✦ Try it free

Related Terms

Constitutional AI

Anthropic's technique for training helpful, harmless AI using a set of written p…

View →

RLHF

Shorthand for Reinforcement Learning from Human Feedback — the alignment trainin…

View →

AI Safety

The interdisciplinary field studying how to develop AI systems that are safe, re…

View →

← Browse all 100 terms