Training

DPO (Direct Preference Optimization)

A training method that aligns models to human preferences without requiring a separate reward model.

Full Definition

Direct Preference Optimization is an alternative to RLHF that directly optimises a language model on preference data — pairs of (chosen, rejected) responses to a prompt — without the complexity of training a separate reward model and running reinforcement learning. DPO reframes preference learning as a classification problem using a simple cross-entropy loss, making it more stable, computationally cheaper, and easier to implement than PPO-based RLHF. It achieves comparable alignment results on most tasks. DPO has become widely adopted in open-source fine-tuning because it fits into standard supervised training pipelines with minimal modification.

Examples

Fine-tuning Llama 3 on 10,000 preference pairs using DPO to make it more polite and on-topic, matching GPT-4's conversational quality at a fraction of the cost.

Aligning a medical question-answering model using DPO on clinician-ranked response pairs, without the instability of PPO training.

Apply this in your prompts

Prompt𝙸t𝙸n automatically uses techniques like DPO (Direct Preference Optimization) to build better prompts for you.

✦ Try it free

Related Terms

Reinforcement Learning from Human Feedback

A training technique that uses human preference ratings to align model outputs w…

View →

RLHF

Shorthand for Reinforcement Learning from Human Feedback — the alignment trainin…

View →

Instruction Tuning

Supervised fine-tuning on diverse instruction-response pairs to improve a model'…

View →

← Browse all 100 terms