What Fine-Tuning Actually Changes
Fine-tuning is the process of continuing to train a pretrained language model on a new, smaller dataset — updating the model's internal weights so it learns new behaviors or specializes its existing knowledge. When you fine-tune, you're not overwriting the model's general capabilities — you're adding a layer of specialization on top of them. Think of a general-purpose model as a highly educated generalist. Fine-tuning is like putting that generalist through a six-month intensive residency in one specific domain. They emerge with deeper, faster pattern recognition for that domain, while mostly retaining their general knowledge. Critically: fine-tuning changes the model itself. Prompt engineering only changes what you say to the model. This distinction matters because it determines cost, reversibility, and when each approach makes sense.
Fine-Tuning vs. Prompt Engineering: The Key Differences
Prompt engineering and fine-tuning both shape model behavior, but they operate at completely different layers. Prompt engineering is free, instant, reversible, and available to anyone — you iterate until you get good outputs, then save and reuse the prompt. Fine-tuning requires a labeled dataset, compute time, and cost — and once a model is fine-tuned, you can't easily undo or adjust the behavior without retraining. Prompt engineering covers a surprisingly large surface area: you can set detailed personas, specify complex formats, inject domain knowledge as context, and constrain behavior extensively within a prompt. Fine-tuning is better when: the task requires consistently specific outputs that would need an extremely long, brittle prompt to achieve; the domain has specialized vocabulary or style that isn't well-represented in the base model's training; or you need low-latency inference with very short prompts (fine-tuned models need less prompting).
What Fine-Tuning Is (and Isn't) Good For
Fine-tuning excels at teaching style and format. If you want a model to always respond in your company's exact brand voice, use your product's specific terminology, or output structured data in a precise schema — these are learnable patterns that fine-tuning can bake in. Fine-tuning is not good at teaching factual knowledge. Contrary to popular belief, you cannot simply give a model a dataset of your company's docs and expect it to know everything in them. Fine-tuning teaches patterns and style; it's surprisingly unreliable at memorizing and accurately retrieving specific facts. For knowledge retrieval from documents, RAG (Retrieval-Augmented Generation) is almost always the better choice. Fine-tuning is also not a solution for reducing hallucinations — a fine-tuned model can hallucinate with the same confidence as the base model.
The Data Requirements for Fine-Tuning
Fine-tuning requires labeled training examples: pairs of inputs and the exact outputs you want the model to produce for those inputs. The quality of these examples matters far more than the quantity — a fine-tune on 500 high-quality, diverse examples will outperform one on 5,000 inconsistent ones. Creating good fine-tuning data is expensive: it requires a human (usually a domain expert) to write examples, review them for quality, and ensure diversity across the cases you care about. For most organizations, this data creation work costs more than the compute for training. If you don't have the budget to create excellent training data, prompt engineering will produce better results — because a mediocre fine-tune can actually make outputs worse by reinforcing inconsistent patterns.
When Fine-Tuning Is Worth the Investment
The clearest case for fine-tuning is high-volume, consistent tasks. If you're running the same type of request tens of thousands of times per day — customer support classification, structured extraction from documents, generating product descriptions in a specific format — the economics can justify fine-tuning because a fine-tuned model can do the job with a much shorter prompt, reducing token costs and latency. The second clear case is highly specialized domains: medical coding, legal contract analysis, code in a proprietary language. If the domain requires vocabulary and reasoning patterns that are underrepresented in general training data, fine-tuning on domain-specific examples can produce meaningfully better results than prompting. For everything else — one-off tasks, varied use cases, exploratory applications — prompt engineering is the right first and often only step.
A Practical Decision Framework
Before fine-tuning, ask: Have I exhausted prompt engineering? A well-structured prompt with role, context, task, constraints, and format can achieve better results than a poorly executed fine-tune. Could RAG solve this instead? If the use case requires factual recall from documents, RAG is cheaper and more accurate. Do I have 500+ high-quality labeled examples? Without good data, fine-tuning will disappoint. Is this task high-volume and consistent? If you'll run it millions of times with the same structure, fine-tuning economics make sense. Is the domain specialized enough that a general model underperforms? If yes, fine-tuning may provide a meaningful ceiling lift. If the answers are no to most of these, keep iterating on your prompt.