DPO (Direct Preference Optimization)
A training method that aligns models to human preferences without requiring a separate reward model.
Full Definition
Direct Preference Optimization is an alternative to RLHF that directly optimises a language model on preference data — pairs of (chosen, rejected) responses to a prompt — without the complexity of training a separate reward model and running reinforcement learning. DPO reframes preference learning as a classification problem using a simple cross-entropy loss, making it more stable, computationally cheaper, and easier to implement than PPO-based RLHF. It achieves comparable alignment results on most tasks. DPO has become widely adopted in open-source fine-tuning because it fits into standard supervised training pipelines with minimal modification.
Examples
Fine-tuning Llama 3 on 10,000 preference pairs using DPO to make it more polite and on-topic, matching GPT-4's conversational quality at a fraction of the cost.
Aligning a medical question-answering model using DPO on clinician-ranked response pairs, without the instability of PPO training.
Apply this in your prompts
PromptITIN automatically uses techniques like DPO (Direct Preference Optimization) to build better prompts for you.
Related Terms
Reinforcement Learning from Human Feedback
A training technique that uses human preference ratings to align model outputs w…
View →RLHF
Shorthand for Reinforcement Learning from Human Feedback — the alignment trainin…
View →Instruction Tuning
Supervised fine-tuning on diverse instruction-response pairs to improve a model'…
View →