Training

Reinforcement Learning from Human Feedback

A training technique that uses human preference ratings to align model outputs with human values.

Full Definition

RLHF trains a reward model from human preference comparisons (which of two responses is better?), then uses reinforcement learning — typically PPO (Proximal Policy Optimization) — to fine-tune the language model to maximise the learned reward. This process aligns the model with nuanced human values (helpfulness, honesty, harmlessness) that are difficult to specify as explicit rules. InstructGPT (2022) demonstrated that RLHF-trained models are preferred by humans over much larger purely supervised models. The technique is computationally expensive and sensitive to the quality of the human preference data and the reward model. It is used in the production training of GPT-4, Claude, and Gemini.

Examples

Human raters comparing Claude response A vs. response B on helpfulness and harmlessness, with their preferences used to train a reward model.

Using PPO to shift a model's output distribution toward high-reward responses as scored by the trained reward model.

Apply this in your prompts

Prompt𝙸t𝙸n automatically uses techniques like Reinforcement Learning from Human Feedback to build better prompts for you.

✦ Try it free

Related Terms

RLHF

Shorthand for Reinforcement Learning from Human Feedback — the alignment trainin…

View →

DPO (Direct Preference Optimization)

A training method that aligns models to human preferences without requiring a se…

View →

AI Alignment

The research field focused on ensuring AI systems pursue goals and values intend…

View →

← Browse all 100 terms