Why JSON Extraction Is Non-Trivial
Language models are trained to generate human-readable text — not machine-readable structured data. Left to their defaults, models will: wrap JSON in markdown code fences (breaking parsers), include explanatory text before or after the JSON (also breaking parsers), omit fields they don't have values for (creating inconsistent schema), add fields you didn't ask for, use inconsistent string formatting for the same value type, and sometimes generate well-formed but syntactically invalid JSON. Each of these failure modes requires specific countermeasures in your prompt design and application code.
The Core JSON Prompt Pattern
The most reliable JSON prompting formula: (1) specify the exact schema with field names and types, (2) provide a minimal example of the expected output, (3) include explicit exclusion instructions for common failure modes, (4) specify how to handle missing or unknown data. The combination: 'Respond with valid JSON only — no explanation, no markdown code fences, no additional text. Use this exact schema: [schema]. Example output: [example]. If a field value is unknown, use null rather than omitting the field.' This four-element formula addresses the most common failure modes in a single instruction block.
Handling Optional Fields and Null Values
Inconsistent null handling is the most common source of downstream parsing errors in JSON extraction pipelines. Models default to omitting fields they don't have values for — which breaks any code that expects a consistent schema. Fix this with explicit null instructions: 'If a field value is not present in the source text, set it to null rather than omitting the key.' For enum fields (fields with a fixed set of valid values), list the valid values explicitly and include 'if the value doesn't clearly match one of these options, use null' — this prevents the model from inventing creative near-matches that break enum validation.
Schema Complexity and Nested Objects
Simple flat schemas are reliable. As schemas become more complex (nested objects, arrays of objects, conditional fields), reliability decreases. For complex schemas: provide a complete, concrete example of the expected output rather than just the schema definition — models follow examples more reliably than abstract type specifications. For arrays of objects, show 2–3 example items in the array. For deeply nested structures, consider breaking the extraction into multiple prompts (extract each major section separately) and assembling in application code — this is more reliable than a single complex extraction.
Validation, Retry, and Error Recovery
Even with well-crafted prompts, JSON extraction fails occasionally — especially for complex schemas or ambiguous source text. Production pipelines need validation and retry logic. Validation: parse the model's output with a JSON parser and validate against your schema (jsonschema in Python, Zod in TypeScript). On validation failure: retry the request, optionally with the validation error message included in the retry prompt ('your previous response was invalid JSON. The error was: [error]. Try again, responding with valid JSON only.'). For critical pipelines, implement a maximum retry limit with fallback to human review for persistent failures.