Constitutional AI
Anthropic's technique for training helpful, harmless AI using a set of written principles as the training signal.
Full Definition
Constitutional AI (CAI), developed by Anthropic, trains AI models to be helpful and harmless using a written 'constitution' — a set of principles about safe and beneficial behaviour — rather than relying entirely on human-rated examples. The process involves two phases: supervised learning where the model critiques and revises its own harmful responses guided by the constitution, and reinforcement learning where a preference model trained on AI-generated preference data (AI Feedback, rather than Human Feedback) is used to further align the model. CAI reduces the human labelling bottleneck in safety training and makes the AI's values more explicit, auditable, and adjustable. It is the foundation of Claude's training.
Examples
Claude being trained to critique responses that are 'harmful or dishonest' and rewrite them according to the principle 'Choose the response that is least likely to contain harmful or unethical content.'
Anthropic using CAI to scale harmlessness training without requiring human labellers to be exposed to large volumes of harmful content.
Apply this in your prompts
PromptITIN automatically uses techniques like Constitutional AI to build better prompts for you.