Safety

Adversarial Prompting

Crafting inputs specifically designed to cause a model to behave incorrectly or unsafely.

Full Definition

Adversarial prompting encompasses techniques used to probe the boundaries of a model's safety training, reliability, or factual accuracy through carefully crafted inputs. Unlike benign edge-case failures, adversarial prompts are intentionally designed to elicit specific undesired behaviours: bypassing content policies (jailbreaks), extracting system prompts, causing prompt injection, eliciting hallucinations, or triggering politically biased responses. Adversarial prompting is a core activity in red-teaming and safety evaluation. Understanding adversarial techniques is essential for building robust applications and for model developers designing defences. Many adversarial patterns are catalogued in public red-teaming resources like ATLAS and PromptBench.

Examples

Constructing a 'many-shot jailbreak' where 100+ example (question, harmful answer) pairs in the context gradually normalise the model to respond to harmful requests.

Using Unicode lookalike characters to bypass keyword-based safety filters while maintaining human-readable meaning.

Apply this in your prompts

Prompt𝙸t𝙸n automatically uses techniques like Adversarial Prompting to build better prompts for you.

✦ Try it free

Related Terms

Jailbreak

A prompt designed to bypass a model's safety guidelines and elicit restricted co…

View →

Prompt Injection

An attack where malicious text in external data hijacks the model's instruction-…

View →

Red Teaming

Systematically testing an AI system by attempting to elicit harmful or unintende…

View →

← Browse all 100 terms