Inference Providers Compared

An inference provider hosts open-weight LLMs (Llama, DeepSeek, Qwen, Mistral, and their fine-tunes) and exposes them through an OpenAI-compatible API. You get elastic capacity, pay per token, and avoid the capex and ops of GPU fleets.

Why use one instead of a frontier API

Open weights. You can move to a different provider — or to your own metal — if prices or latency change.
Cheaper for narrow tasks. An 8B or 70B open model fine-tuned for your classification job often beats a frontier-priced call.
Speed. Speculative decoding and custom kernels from Together and Fireworks frequently outpace provider-hosted equivalents.
Customization. Host your own LoRA adapter; call it through the same API.

What varies between providers

Model catalog. How fast do they ship the latest Llama/DeepSeek/Qwen release? Usually hours to days for the big two.
Per-token price. Shop around — 2-3x differences are common for the same model.
Tokens/sec and TTFT. Measured independently, not just spec-sheet numbers.
Batch endpoints. If you have async workloads, cheaper batched APIs matter a lot.
Fine-tune support. Together and Fireworks both host LoRA adapters; some competitors do not.

The main players

Together AI — broad model catalog, strong fine-tuning, competitive latency.
Fireworks AI — focus on speed and throughput; their own inference engine.
Groq, Cerebras, SambaNova — specialty silicon; extremely fast TTFT on supported models but a narrower catalog.
DeepInfra, Replicate — long tail of models; latency and reliability vary.

Failure modes

Quiet quantization. Some providers serve fp8 or int4 quantized versions without being loud about it. Output quality shifts. Check the model spec.
Cold starts. Rare or just-deployed models may have multi-second TTFT on the first request. Warm with a scheduled ping.
Rate-limit cliffs. Free tier stops working at 7 pm on launch day. Pay for a committed tier before you need it.

When NOT to use one

Frontier capability needed. GPT-5, Claude Opus, Gemini Ultra have no open equivalent at parity yet. Use them directly.
Regulated data. If SOC2/HIPAA/data-residency matters more than price, a frontier provider with a signed DPA is often simpler than vetting a newer inference company.