When to Fine-tune

Fine-tuning means continuing training on a base model with your own labeled examples. The result is a new set of weights (or a LoRA adapter) that the model loads at inference time. It can shift style, specialize on a vocabulary, or bake in a structured output format.

When fine-tuning actually helps

Consistent structured output. You need strict JSON schemas every time; few-shot prompting works but drifts. A few thousand labeled examples lock it in.
Style matching. Producing tone-consistent customer emails or brand copy across thousands of generations.
Narrow classification. Your task is picking one of 200 labels; a tuned small model beats a prompted large model on both cost and accuracy.
Latency-sensitive paths. A tuned 8B model running locally is 5-10x faster than a hosted 70B.

When to NOT fine-tune

Knowledge injection. Fine-tuning is bad at "the model should know about our Q3 launch." Use RAG.
Early-stage apps. You don't know what the model needs to learn until you have real traffic and failed cases.
Changing requirements. Every product pivot means a new training run. Prompts are version-controlled with your code; fine-tunes are not.
Alignment tasks. Base instruction-following from Claude/GPT/Gemini is better than anything you'll get from tuning a small open model.

The typical workflow

Collect 500-5000 (input, ideal output) pairs. Quality beats quantity hard.
Split train/eval 90/10.
Run a LoRA fine-tune on a smaller open model first — cheap sanity check.
Measure against the eval set AND against the prompted baseline. If the prompt wins, stop.
If you are committed, run a full fine-tune on the provider (OpenAI, Together, Fireworks).

Providers

OpenAI — hosted fine-tune of GPT-4.1 family, easy API, no infra.
Together AI — full and LoRA tuning on many open models, fast turnaround.
Fireworks AI — similar posture, focused on inference speed post-tune.

Failure modes

Catastrophic forgetting. Tuned model loses general capability — now it can classify but can't write a normal email. Use LoRA or lower learning rates.
Overfitting. 100 training examples, 99% train accuracy, real accuracy drops. Watch eval loss, not train loss.
Locked-in mistakes. You trained on a dataset with a systematic error; every output reflects it forever. Re-check labels.

Rule of thumb: try three rounds of prompt iteration and a RAG pass before considering a fine-tune.