The /v2/extract endpoint extracts structured data from PDF documents. It supports optional bounding box citations and handles large documents efficiently.
The endpoint returns a job ID for asynchronous processing. Use the job management endpoints to check status and retrieve results.
JSON schema defining the structure and fields to extract from the document. Must be a valid JSON Schema format with type definitions, properties, and required fields.
Model tier to use for extraction. Available tiers: alpha, beta, gamma, delta.Recommended: gamma (default), the thorough tier. Tiers: alpha (fast), beta (balanced), gamma (thorough), delta (advanced).
Return bounding box coordinates for extracted values. When enabled, each extracted field includes a citation object with precise location data in the source document.
Run a pre-flight PII check before extraction. If PII is found at or above pii_block_severity, no extraction job is created and the endpoint returns HTTP 200 with "status": "pii_blocked" and "blocked": true in the body, so check those fields before polling.
When detect_pii is enabled and PII at or above the threshold is found, the endpoint returns HTTP 200 with no job created. Check status and blocked to tell this apart from a normal submission. The reason is "pii_detected" when pii_block_severity is any, or "pii_detected:<severity>" for the other thresholds:
{ "type": "object", "properties": { "Individuals": { "type": "string", "description": "Percentage Holding" }, "LIC of India": { "type": "string", "description": "No of Shares Held" }, "United bank of india": { "type": "string", "description": "No of shares held by United bank of india" } }, "required": [ "Individuals", "LIC of India", "United bank of india" ], "additionalProperties": false}
The enable_citations parameter controls whether bounding box coordinates are returned with extracted data. Citations provide references back to the source document, allowing you to trace where each extracted value was found.
When enable_citations is set to True, each extracted field includes a citation object with precise location data (or null when the value could not be grounded):
When enable_citations is False (default), no grounding pass runs. On the default gamma tier each field still uses the nested shape, with grounding_score: 0.0 and citation: null. On the alpha, beta, and delta tiers the response uses a legacy flat shape instead: {"value": ..., "score": <number>} with no citation key. The nested gamma shape:
Set enable_citations to True when you need to trace extracted values back to their exact location in the document, such as for UI highlighting or audit trails.
The schema_data parameter must be a valid JSON Schema that defines the structure of data to extract. All schemas must follow the JSON Schema specification with proper type definitions, properties, and constraints.
This example shows a more complex schema for extracting detailed shareholding information:
{ "type": "object", "properties": { "shares held by Punjab National bank": { "type": "string", "description": "shares held by Punjab National bank" }, "shares held by IFCI": { "type": "string", "description": "shares held by IFCI" }, "shareholding pattern": { "type": "object", "description": "shareholding pattern", "properties": { "Percentage holding": { "type": "array", "description": "percentage holding of shareholders in ACRE", "items": { "type": "string", "description": "percentage holding of shareholders in ACRE" } }, "Name of shareholders": { "type": "array", "description": "Names of shareholders in ACRE", "items": { "type": "string", "description": "Names of shareholders in ACRE" } } }, "required": ["Percentage holding", "Name of shareholders"], "additionalProperties": false }, "names of board of directors": { "type": "array", "description": "list of names of members of board of directors in ACRE", "items": { "type": "object", "properties": { "names of board of directors": { "type": "string", "description": "list of names of members of board of directors in ACRE" } }, "required": ["names of board of directors"], "additionalProperties": false } } }, "required": [ "shares held by Punjab National bank", "shares held by IFCI", "shareholding pattern", "names of board of directors" ], "additionalProperties": false}
JSON schema defining the structure and fields to extract from the document. Example: {"type":"object","properties":{"invoice_number":{"type":"string","description":"The invoice number"}},"required":["invoice_number"],"additionalProperties":false}
Run a pre-flight PII check before extraction. If PII is found at or above pii_block_severity, no extraction job is created and the endpoint returns HTTP 200 with "status": "pii_blocked" and "blocked": true in the body.