When to Fine-tune
Fine-tuning means continuing training on a base model with your own labeled examples. The result is a new set of weights (or a LoRA adapter) that the model loads at inference time. It can shift style, specialize on a vocabulary, or bake in a structured output format.
When fine-tuning actually helps
- Consistent structured output. You need strict JSON schemas every time; few-shot prompting works but drifts. A few thousand labeled examples lock it in.
- Style matching. Producing tone-consistent customer emails or brand copy across thousands of generations.
- Narrow classification. Your task is picking one of 200 labels; a tuned small model beats a prompted large model on both cost and accuracy.
- Latency-sensitive paths. A tuned 8B model running locally is 5-10x faster than a hosted 70B.
When to NOT fine-tune
- Knowledge injection. Fine-tuning is bad at "the model should know about our Q3 launch." Use RAG.
- Early-stage apps. You don't know what the model needs to learn until you have real traffic and failed cases.
- Changing requirements. Every product pivot means a new training run. Prompts are version-controlled with your code; fine-tunes are not.
- Alignment tasks. Base instruction-following from Claude/GPT/Gemini is better than anything you'll get from tuning a small open model.
The typical workflow
- Collect 500-5000 (input, ideal output) pairs. Quality beats quantity hard.
- Split train/eval 90/10.
- Run a LoRA fine-tune on a smaller open model first — cheap sanity check.
- Measure against the eval set AND against the prompted baseline. If the prompt wins, stop.
- If you are committed, run a full fine-tune on the provider (OpenAI, Together, Fireworks).
Providers
- OpenAI — hosted fine-tune of GPT-4.1 family, easy API, no infra.
- Together AI — full and LoRA tuning on many open models, fast turnaround.
- Fireworks AI — similar posture, focused on inference speed post-tune.
Failure modes
- Catastrophic forgetting. Tuned model loses general capability — now it can classify but can't write a normal email. Use LoRA or lower learning rates.
- Overfitting. 100 training examples, 99% train accuracy, real accuracy drops. Watch eval loss, not train loss.
- Locked-in mistakes. You trained on a dataset with a systematic error; every output reflects it forever. Re-check labels.
Rule of thumb: try three rounds of prompt iteration and a RAG pass before considering a fine-tune.