Synthetic Data
Training data generated by AI models rather than collected from human-created sources.
Full Definition
Synthetic data is artificially generated training data, typically produced by a powerful teacher model, used to train or fine-tune a smaller student model. It is attractive because human labelling is expensive and slow, while AI generation is cheap and fast. Techniques include self-instruct (asking the model to generate instruction-response pairs), constitutional AI critique-and-revision loops, and data augmentation via paraphrasing. Risks include distributional collapse (the student inherits the teacher's biases amplified over training iterations) and a lack of ground-truth novelty. Despite risks, synthetic data has enabled models like Alpaca and Phi-3 to achieve strong performance at low cost.
Examples
Stanford Alpaca using GPT-3.5 to generate 52,000 instruction-response pairs for $500, then using them to fine-tune Llama.
Microsoft's Phi-3-mini being trained heavily on synthetic 'textbook quality' data generated by GPT-4, achieving strong reasoning despite small size.
Apply this in your prompts
PromptITIN automatically uses techniques like Synthetic Data to build better prompts for you.