Content Moderation
Automated or human review of AI inputs and outputs to prevent harmful, illegal, or policy-violating content.
Full Definition
Content moderation for LLM applications involves detecting and filtering harmful content at multiple stages: input filtering (preventing harmful prompts from reaching the model), output filtering (catching harmful responses before they reach users), and post-hoc logging and review. Automated moderation uses classifier models (like OpenAI's Moderation API) trained to detect hate speech, violence, sexual content, self-harm promotion, and other categories. Human review is layered on top for edge cases and for improving classifier training data. Effective moderation is a balance: too strict blocks legitimate use; too lax enables harm. False positive and false negative rates must be tuned to application context.
Examples
Using OpenAI's Moderation API endpoint to classify every user message before passing it to the generation model, blocking messages that score above threshold on 'violence' or 'sexual' categories.
A children's educational platform adding a whitelist of allowed topics on top of a general-purpose model to prevent any off-topic content.
Apply this in your prompts
PromptITIN automatically uses techniques like Content Moderation to build better prompts for you.