Tag
#llm-safety
9 posts tagged llm-safety.
- ops
Fine-Tuned Classifiers vs. Off-the-Shelf Moderation APIs: Cost & Tradeoffs
Off-the-shelf moderation APIs are cheap to start and expensive to outgrow. Fine-tuned classifiers are the reverse.
- guides
Llama Guard vs Llama Guard 2 vs Llama Guard 3: The Lineage, Clarified
Meta's Llama Guard series gets cited loosely, often with the wrong base model or category count. Here's the verified lineage — base models, taxonomies
- reviews
Perspective API: Good at Its Original Job, Wrong for LLM Safety
Jigsaw's Perspective API has 8+ years of production data on toxicity detection. For community content moderation it remains strong.
- ops
Content Moderation for RAG: The Retrieval Layer Is an Attack Path
RAG pipelines have a moderation problem at the retrieval layer that input/output classifiers don't address. Injected content in retrieved documents can
- ops
Classifier Ensembles for Production Content Moderation
Single classifiers have characteristic failure modes. Ensembles that combine models with different architectures and training distributions reduce
- ops
False Positive Costs in Content Moderation: How to Measure Them
False positives in content moderation drive hidden costs: user abandonment, review-queue spend, appeal load. Learn how to quantify them and calibrate
- reviews
OpenAI Moderation API Review: Strengths and Real Gaps
An honest OpenAI Moderation API review: fast (~20ms) and free with credits, strong category breadth, but predictable gaps on obfuscated text, context, and
- reviews
Llama Guard Benchmark Review: Real Performance vs. Vendor Claims
Meta's Llama Guard series has become a default choice for open-source content moderation. Benchmarks on the standard test sets look strong.
- reviews
NeMo Guardrails in Production: What It Does Well; Where It Fails
NVIDIA's NeMo Guardrails offers conversation-flow control that classifiers can't provide. The deployment complexity is real.