AIStackWatch
Back to wiki

When to Fine-tune

Fine-tuning means continuing training on a base model with your own labeled examples. The result is a new set of weights (or a LoRA adapter) that the model loads at inference time. It can shift style, specialize on a vocabulary, or bake in a structured output format.

When fine-tuning actually helps

  • Consistent structured output. You need strict JSON schemas every time; few-shot prompting works but drifts. A few thousand labeled examples lock it in.
  • Style matching. Producing tone-consistent customer emails or brand copy across thousands of generations.
  • Narrow classification. Your task is picking one of 200 labels; a tuned small model beats a prompted large model on both cost and accuracy.
  • Latency-sensitive paths. A tuned 8B model running locally is 5-10x faster than a hosted 70B.

When to NOT fine-tune

  • Knowledge injection. Fine-tuning is bad at "the model should know about our Q3 launch." Use RAG.
  • Early-stage apps. You don't know what the model needs to learn until you have real traffic and failed cases.
  • Changing requirements. Every product pivot means a new training run. Prompts are version-controlled with your code; fine-tunes are not.
  • Alignment tasks. Base instruction-following from Claude/GPT/Gemini is better than anything you'll get from tuning a small open model.

The typical workflow

  1. Collect 500-5000 (input, ideal output) pairs. Quality beats quantity hard.
  2. Split train/eval 90/10.
  3. Run a LoRA fine-tune on a smaller open model first — cheap sanity check.
  4. Measure against the eval set AND against the prompted baseline. If the prompt wins, stop.
  5. If you are committed, run a full fine-tune on the provider (OpenAI, Together, Fireworks).

Providers

  • OpenAI — hosted fine-tune of GPT-4.1 family, easy API, no infra.
  • Together AI — full and LoRA tuning on many open models, fast turnaround.
  • Fireworks AI — similar posture, focused on inference speed post-tune.

Failure modes

  • Catastrophic forgetting. Tuned model loses general capability — now it can classify but can't write a normal email. Use LoRA or lower learning rates.
  • Overfitting. 100 training examples, 99% train accuracy, real accuracy drops. Watch eval loss, not train loss.
  • Locked-in mistakes. You trained on a dataset with a systematic error; every output reflects it forever. Re-check labels.

Rule of thumb: try three rounds of prompt iteration and a RAG pass before considering a fine-tune.