LLM Safety Filters
Safety filters are the provider-side mechanisms that detect and block generations covering sexual abuse material, weapon synthesis, self-harm encouragement, certain political topics, and regulated domains (medical, legal, financial advice). They run as part of or in parallel with the main model and are partly why frontier APIs feel different from raw base models.
Where filtering happens
- Pre-generation moderation. The provider checks the prompt against a policy classifier. If it trips hard rules, the request is rejected before inference.
- In-generation steering. The main model has been RLHF'd to refuse or redirect policy-violating requests.
- Post-generation scan. A second classifier reads the output; violating content is masked or replaced with a refusal.
All three happen on major frontier APIs. Exact policies differ — OpenAI, Anthropic, and Google all publish usage policies worth reading before launch.
Common legitimate false positives
- Security research — anything about malware, exploits, or offensive techniques often triggers refusal even when clearly defensive.
- Medical / mental-health apps — legitimate user questions get flagged as self-harm risk.
- Fiction with violence — writing tools hit refusals on scenes that would be PG-13 in any book.
- Minority-language outputs — safety classifiers are English-first; non-English refusal rates are higher.
Plan for these. An eval slice on each sensitive category catches regressions when provider policies shift.
Building your own guardrails
Provider filters cover their policies, not yours. You likely need additional checks:
- PII leak guard — regex plus classifier for SSN, CC, email in outputs.
- Brand guardrails — block the model from recommending competitors or making promises about pricing.
- Jailbreak detection — classifier on user turns to flag injection attempts for logging and rate-limiting.
- Schema enforcement — structured output is a guardrail; a well-typed response cannot say anything unsafe outside the fields.
Failure modes
- Over-refusal destroys UX. Users learn to rephrase around the filter or leave. Measure refusal rate on real traffic; treat a spike like a latency spike.
- Refusal text leaks the policy. "I can't discuss firearms maintenance" is itself information. Generic "I can't help with that" is safer.
- Policy drift between providers. Multi-provider routing produces inconsistent refusals for identical inputs. Layer your own filter on top so behavior stays consistent.
When NOT to disable
Never. Even B2B apps with sophisticated users benefit from provider safety — it catches supply-chain prompt injections, adversarial uploads, and the one angry ex-employee. Build additional controls on top; don't try to route around the defaults.