Safety

Content Moderation

Automated or human review of AI inputs and outputs to prevent harmful, illegal, or policy-violating content.

Full Definition

Content moderation for LLM applications involves detecting and filtering harmful content at multiple stages: input filtering (preventing harmful prompts from reaching the model), output filtering (catching harmful responses before they reach users), and post-hoc logging and review. Automated moderation uses classifier models (like OpenAI's Moderation API) trained to detect hate speech, violence, sexual content, self-harm promotion, and other categories. Human review is layered on top for edge cases and for improving classifier training data. Effective moderation is a balance: too strict blocks legitimate use; too lax enables harm. False positive and false negative rates must be tuned to application context.

Examples

Using OpenAI's Moderation API endpoint to classify every user message before passing it to the generation model, blocking messages that score above threshold on 'violence' or 'sexual' categories.

A children's educational platform adding a whitelist of allowed topics on top of a general-purpose model to prevent any off-topic content.

Apply this in your prompts

PromptIt automatically uses techniques like Content Moderation to build better prompts for you.

✦ Try it free

Related Terms

Guardrails

Programmatic constraints that prevent an AI application from producing or acting…

View →

AI Safety

The interdisciplinary field studying how to develop AI systems that are safe, re…

View →

Red Teaming

Systematically testing an AI system by attempting to elicit harmful or unintende…

View →

← Browse all 100 terms