Safety

Constitutional AI

Anthropic's technique for training helpful, harmless AI using a set of written principles as the training signal.

Full Definition

Constitutional AI (CAI), developed by Anthropic, trains AI models to be helpful and harmless using a written 'constitution' — a set of principles about safe and beneficial behaviour — rather than relying entirely on human-rated examples. The process involves two phases: supervised learning where the model critiques and revises its own harmful responses guided by the constitution, and reinforcement learning where a preference model trained on AI-generated preference data (AI Feedback, rather than Human Feedback) is used to further align the model. CAI reduces the human labelling bottleneck in safety training and makes the AI's values more explicit, auditable, and adjustable. It is the foundation of Claude's training.

Examples

Claude being trained to critique responses that are 'harmful or dishonest' and rewrite them according to the principle 'Choose the response that is least likely to contain harmful or unethical content.'

Anthropic using CAI to scale harmlessness training without requiring human labellers to be exposed to large volumes of harmful content.

Apply this in your prompts

Prompt𝙸t𝙸n automatically uses techniques like Constitutional AI to build better prompts for you.

✦ Try it free

Related Terms

AI Alignment

The research field focused on ensuring AI systems pursue goals and values intend…

View →

RLHF

Shorthand for Reinforcement Learning from Human Feedback — the alignment trainin…

View →

AI Safety

The interdisciplinary field studying how to develop AI systems that are safe, re…

View →

← Browse all 100 terms