Training

Synthetic Data

Training data generated by AI models rather than collected from human-created sources.

Full Definition

Synthetic data is artificially generated training data, typically produced by a powerful teacher model, used to train or fine-tune a smaller student model. It is attractive because human labelling is expensive and slow, while AI generation is cheap and fast. Techniques include self-instruct (asking the model to generate instruction-response pairs), constitutional AI critique-and-revision loops, and data augmentation via paraphrasing. Risks include distributional collapse (the student inherits the teacher's biases amplified over training iterations) and a lack of ground-truth novelty. Despite risks, synthetic data has enabled models like Alpaca and Phi-3 to achieve strong performance at low cost.

Examples

Stanford Alpaca using GPT-3.5 to generate 52,000 instruction-response pairs for $500, then using them to fine-tune Llama.

Microsoft's Phi-3-mini being trained heavily on synthetic 'textbook quality' data generated by GPT-4, achieving strong reasoning despite small size.

Apply this in your prompts

Prompt𝙸t𝙸n automatically uses techniques like Synthetic Data to build better prompts for you.

✦ Try it free

Related Terms

Dataset

A structured collection of data examples used to train, validate, or evaluate a …

View →

Training Data

The corpus of examples a model learns from during its training process.…

View →

Fine-Tuning

Continuing training of a pretrained model on a smaller, task-specific dataset to…

View →

← Browse all 100 terms