Adversarial Prompting
Crafting inputs specifically designed to cause a model to behave incorrectly or unsafely.
Full Definition
Adversarial prompting encompasses techniques used to probe the boundaries of a model's safety training, reliability, or factual accuracy through carefully crafted inputs. Unlike benign edge-case failures, adversarial prompts are intentionally designed to elicit specific undesired behaviours: bypassing content policies (jailbreaks), extracting system prompts, causing prompt injection, eliciting hallucinations, or triggering politically biased responses. Adversarial prompting is a core activity in red-teaming and safety evaluation. Understanding adversarial techniques is essential for building robust applications and for model developers designing defences. Many adversarial patterns are catalogued in public red-teaming resources like ATLAS and PromptBench.
Examples
Constructing a 'many-shot jailbreak' where 100+ example (question, harmful answer) pairs in the context gradually normalise the model to respond to harmful requests.
Using Unicode lookalike characters to bypass keyword-based safety filters while maintaining human-readable meaning.
Apply this in your prompts
PromptITIN automatically uses techniques like Adversarial Prompting to build better prompts for you.