Skip to main content
Extraction pulls typed fields out of a document against a JSON schema we define, returning each leaf tagged with its own confidence score. For raw Markdown or just a category label instead, see the Parse quickstart or the Classification quickstart.
By the end, we’ll have a script that gives an invoice PDF and a schema to /v2/extract, waits for the job to finish, and writes the matched fields back as a clean JSON object with per-field confidence scores. Grab the full script from the dropdown below if you’d rather skip the walkthrough.
Set UNSILOED_API_KEY in your environment and save the document you want to extract from as document.pdf in the same directory before running.
extract_document.py
import json
import os
import time
import requests

API_KEY = os.environ["UNSILOED_API_KEY"]
BASE_URL = "https://prod.visionapi.unsiloed.ai"

schema = {
    "type": "object",
    "properties": {
        "vendor_name": {
            "type": "string",
            "description": "Name of the company issuing the invoice (the seller)",
        },
        "invoice_number": {
            "type": "string",
            "description": "Unique invoice identifier shown on the document",
        },
        "issue_date": {
            "type": "string",
            "description": "Date the invoice was issued",
        },
        "total_due": {
            "type": "number",
            "description": "Final total amount due in US dollars, including tax",
        },
        "line_items": {
            "type": "array",
            "description": "One row per line item in the invoice table",
            "items": {
                "type": "object",
                "properties": {
                    "description": {"type": "string", "description": "Description of the product or service"},
                    "quantity":    {"type": "number", "description": "Quantity of the item ordered"},
                    "unit_price":  {"type": "number", "description": "Price per unit in US dollars"},
                    "subtotal":    {"type": "number", "description": "Line subtotal in US dollars (quantity x unit_price)"},
                },
                "required": ["description", "quantity", "unit_price", "subtotal"],
                "additionalProperties": False,
            },
        },
    },
    "required": ["vendor_name", "invoice_number", "issue_date", "total_due", "line_items"],
    "additionalProperties": False,
}

with open("document.pdf", "rb") as f:
    response = requests.post(
        f"{BASE_URL}/v2/extract",
        headers={"api-key": API_KEY},
        files={"pdf_file": ("document.pdf", f, "application/pdf")},
        data={"schema_data": json.dumps(schema)},
    )
response.raise_for_status()

job_id = response.json()["job_id"]
print(f"Job submitted: {job_id}")

max_attempts = 60  # roughly 5 minutes at 5 seconds per poll
attempts = 0
while True:
    result = requests.get(
        f"{BASE_URL}/extract/{job_id}",
        headers={"api-key": API_KEY},
    ).json()
    print(f"Status: {result['status']}")
    if result["status"] == "completed":
        break
    if result["status"] == "failed":
        raise RuntimeError(result.get("error", "extract job failed"))
    attempts += 1
    if attempts >= max_attempts:
        raise TimeoutError("Extract job did not finish within 5 minutes")
    time.sleep(5)

with open("result.json", "w") as f:
    json.dump(result, f, indent=2)

print(f"Saved extracted fields to result.json")

Step 1: Set Up Your Environment

Before writing any code, we need three things: an API key, a document, and the runtime for our chosen language.

1.1 Get an Unsiloed AI API Key

To get API access, sign up on Unsiloed AI. Export your key as an environment variable named UNSILOED_API_KEY so it stays out of source control:
export UNSILOED_API_KEY="your-api-key"

1.2 Pick a Document to Extract Fields From

The /v2/extract endpoint supports PDF, DOCX, PPTX, JPG, PNG, and other formats. The walkthrough below assumes a PDF saved as document.pdf in your working directory. To use a different format, update the filename in the snippets to match your file. If you don’t have a document handy, download our sample invoice PDF (a one-page invoice from Northwind Office Supplies with five line items) and save it as document.pdf. The schema in this guide targets the vendor, invoice number, issue date, total, and the line item table on that invoice.

1.3 Install Dependencies

You need Python 3.8 or newer. Install the requests package:
pip install requests

Step 2: Submit a Document With a Schema

The request bundles two fields: pdf_file for the document and schema_data for the JSON schema as a string. The schema is the interesting half. Anything we describe there, from a single total to a nested array of line items, comes back typed and scored in the exact shape we asked for. The endpoint returns a job_id we can poll. All requests go to https://prod.visionapi.unsiloed.ai with the API key in the api-key header.

2.1 Set Up the Script

Create a file called extract_document.py and start with the imports and configuration:
extract_document.py
import json
import os
import time
import requests

API_KEY = os.environ["UNSILOED_API_KEY"]
BASE_URL = "https://prod.visionapi.unsiloed.ai"
API_KEY reads your key from the environment so it doesn’t get hard-coded into the file, and BASE_URL points at the Unsiloed AI production endpoint. Both appear in every request below.

2.2 Define the Schema

The schema tells the API which fields to pull out and what shape they should take. The clearer the description on each field, the better the model locates and types each value. For our sample invoice, we want the vendor name, the invoice number, the issue date, the total due, and the five rows of the line item table.
Continue the file by defining the schema as a Python dict:
extract_document.py
schema = {
    "type": "object",
    "properties": {
        "vendor_name": {
            "type": "string",
            "description": "Name of the company issuing the invoice (the seller)",
        },
        "invoice_number": {
            "type": "string",
            "description": "Unique invoice identifier shown on the document",
        },
        "issue_date": {
            "type": "string",
            "description": "Date the invoice was issued",
        },
        "total_due": {
            "type": "number",
            "description": "Final total amount due in US dollars, including tax",
        },
        "line_items": {
            "type": "array",
            "description": "One row per line item in the invoice table",
            "items": {
                "type": "object",
                "properties": {
                    "description": {"type": "string", "description": "Description of the product or service"},
                    "quantity":    {"type": "number", "description": "Quantity of the item ordered"},
                    "unit_price":  {"type": "number", "description": "Price per unit in US dollars"},
                    "subtotal":    {"type": "number", "description": "Line subtotal in US dollars (quantity x unit_price)"},
                },
                "required": ["description", "quantity", "unit_price", "subtotal"],
                "additionalProperties": False,
            },
        },
    },
    "required": ["vendor_name", "invoice_number", "issue_date", "total_due", "line_items"],
    "additionalProperties": False,
}
The schema is plain JSON Schema with strict-mode rules: additionalProperties: false at every object level (which prevents the model from inventing fields we didn’t ask for), and a required list naming the must-have fields. See the Schemas reference for the full ruleset.

2.3 Upload the Document

Send the file and the schema as a multipart upload to /v2/extract. The endpoint expects the document under the form field name pdf_file and the schema under schema_data (as a JSON string).
Next, upload the document and the schema together:
extract_document.py
with open("document.pdf", "rb") as f:
    response = requests.post(
        f"{BASE_URL}/v2/extract",
        headers={"api-key": API_KEY},
        files={"pdf_file": ("document.pdf", f, "application/pdf")},
        data={"schema_data": json.dumps(schema)},
    )
response.raise_for_status()
The raise_for_status() call throws an HTTPError on any non-2xx response, so we don’t need to check .status_code ourselves. The json.dumps(schema) call serializes the dict because the endpoint expects schema_data as a string, not a nested form field.

2.4 Capture the Job ID

Then read and print the job_id:
extract_document.py
job_id = response.json()["job_id"]
print(f"Job submitted: {job_id}")
Run the script:
python extract_document.py
The output should be a single line like Job submitted: a90e48c4-f564-435e-9bf2-ab6eb5a0376d.

Step 3: Poll for Results

The job runs asynchronously. We GET /extract/{job_id} repeatedly until the status is completed, then save the extracted fields to disk. A status of completed means the result is ready; failed means the job errored; any other value (queued, processing, and so on) means the job is still running.

3.1 Write the Polling Loop

Next, drop in a polling loop. The max_attempts cap stops the loop if the job hangs:
extract_document.py
max_attempts = 60  # roughly 5 minutes at 5 seconds per poll
attempts = 0
while True:
    result = requests.get(
        f"{BASE_URL}/extract/{job_id}",
        headers={"api-key": API_KEY},
    ).json()
    print(f"Status: {result['status']}")
    if result["status"] == "completed":
        break
    if result["status"] == "failed":
        raise RuntimeError(result.get("error", "extract job failed"))
    attempts += 1
    if attempts >= max_attempts:
        raise TimeoutError("Extract job did not finish within 5 minutes")
    time.sleep(5)

3.2 Save the Extracted Fields

Finally, persist the result to disk. The response is already structured JSON, so we write it straight to result.json.
Finally, write the result to disk:
extract_document.py
with open("result.json", "w") as f:
    json.dump(result, f, indent=2)

print(f"Saved extracted fields to result.json")
Run the script:
python extract_document.py
You should see a few Status: processing lines, then Status: completed, then the summary line. The result.json file appears in the working directory.

Error Responses

Failures fall into two buckets: HTTP errors raised before the job is queued, and a failed status on a job that started but couldn’t complete.

HTTP Errors

The /v2/extract endpoint returns JSON bodies on HTTP errors, with a single detail field describing the problem. The common cases:
  • 401 Unauthorized: body is {"detail": "Invalid API key"}. The api-key header is missing or wrong.
  • 400 Bad Request (missing file): body is {"detail": "Either pdf_file or file_url must be provided"}. The pdf_file form field is missing.
  • 400 Bad Request (bad JSON): body is {"detail": "schema_data must be valid JSON"}. The schema_data field isn’t parseable JSON.
  • 400 Bad Request (unreadable PDF): body starts with {"detail": "Error extracting data with citations: Failed to get PDF page count..."}. The uploaded file isn’t a valid PDF.
  • 422 Unprocessable Entity: body lists the missing or malformed form fields. Usually thrown when schema_data is absent.
  • 404 Not Found: body is {"detail": "Job <id> not found"}. The job_id you polled doesn’t exist.

Failed Jobs

A job that was accepted but couldn’t be processed comes back with status: "failed" on a subsequent poll. The response shape mirrors a completed one, with an error field describing what went wrong:
{
  "job_id": "7b31a7d7-e810-4a0b-931e-fbed0879bab2",
  "status": "failed",
  "file_name": "document.pdf",
  "error": "Failed to extract structured data from document"
}

Response Shape

A completed response contains job metadata plus a result object with one entry per top-level field in your schema. Each entry has the extracted value and a score between 0 and 1. For arrays of objects, the array itself has a score, and every property inside each row carries its own score as well.
{
  "job_id": "cec5dcb5-53c6-47d5-afe7-28b2182171fb",
  "status": "completed",
  "file_name": "document.pdf",
  "file_url": "https://example-bucket.s3.amazonaws.com/...",
  "created_at": "2026-05-27T08:36:58.336535+00:00",
  "updated_at": "2026-05-27T08:37:19.436604+00:00",
  "metadata": {
    "page_count": 1,
    "order": ["vendor_name", "invoice_number", "issue_date", "total_due", "line_items"],
    "schema": { "...": "..." }
  },
  "result": {
    "vendor_name":    { "value": "Northwind Office Supplies", "score": 0.96 },
    "invoice_number": { "value": "INV-2026-00487",            "score": 0.97 },
    "issue_date":     { "value": "April 14, 2026",            "score": 0.98 },
    "total_due":      { "value": 3705.1,                      "score": 0.96 },
    "line_items": {
      "score": 0.98,
      "value": [
        {
          "description": { "value": "Ergonomic Mesh Office Chair", "score": 0.95, "citation": null },
          "quantity":    { "value": 4,                              "score": 0.96, "citation": null },
          "unit_price":  { "value": 289.0,                          "score": 0.97, "citation": null },
          "subtotal":    { "value": 1156.0,                         "score": 0.98, "citation": null }
        },
        "...four more rows..."
      ]
    }
  }
}
The fields you’ll actually use depend on what you’re building. They fall into three broad categories: For typed values and validation:
  • result.{field_name}.value: the extracted data, typed to match your schema (string, number, boolean, object, or array)
  • result.{field_name}.score: confidence score between 0 and 1, higher is better. Use it to flag uncertain values for human review.
  • result.{array_field}.value[].{property}.citation: reserved slot for source citations on array rows; null for now
For schema and ordering:
  • metadata.schema: an echo of the schema you submitted, useful for round-tripping or auditing
  • metadata.order: the original order of top-level fields in your schema, since JSON objects don’t preserve insertion order across all clients
  • metadata.page_count: number of pages in the uploaded document
For job and audit tracking:
  • job_id: unique identifier for the extraction job
  • status: completed, failed, or an in-progress value (queued, processing)
  • file_name: name of the uploaded file
  • file_url: temporary signed S3 URL to the uploaded file
  • created_at, updated_at: ISO 8601 timestamps for submission and the most recent status change

Sample Output

Running the script against the sample invoice writes the JSON above to result.json. Every field comes back with its own confidence score, so flagging uncertain values becomes a per-field check rather than a re-read of the document. The fields extracted from the sample:
FieldExtracted valueConfidence
vendor_nameNorthwind Office Supplies96%
invoice_numberINV-2026-0048797%
issue_dateApril 14, 202698%
total_due3705.1096%
line_items5 rows98%
And the five rows of line_items, each a structured object in its own right:
DescriptionQuantityUnit PriceSubtotal
Ergonomic Mesh Office Chair4289.001,156.00
Adjustable Standing Desk (60” x 30”)2549.001,098.00
LED Desk Lamp with USB Charging642.50255.00
Acoustic Panel, 24” Hexagon (Pack of 4)378.00234.00
Wireless Mechanical Keyboard5129.99649.95
Every cell in the table has its own score in the underlying JSON, so downstream code can flag individual uncertain values without rejecting the whole row.

Next Steps

For more on extraction, including schema rules, supported types, and the full response reference, see the Extract overview.

Schemas

JSON Schema rules, supported types, and worked examples for invoices and SEC filings.

Response Format

The canonical extraction response with a field-by-field reference.

API Reference

Browse the full request and response specs for /v2/extract.

FAQ

Check limits, supported formats, and answers to common questions.