Skip to main content
Mail rooms, AP scans, and patient intake forms often arrive as one big PDF with several different documents stacked together. The /splitter endpoint takes that bundle and a list of categories, then returns one labeled PDF per matched category. For other endpoints, see the Parse quickstart, Extraction quickstart, or Classification quickstart.
This walkthrough builds a script that ships a bundled PDF and a category list to /splitter, waits for the split to complete, and writes one labeled PDF per matched category into a local split_files/ directory, ready to drop into per-category downstream pipelines. If you’d rather just copy the whole script, it’s in the dropdown below.
Set UNSILOED_API_KEY in your environment and save the bundled PDF as bundle.pdf in the same directory before running.
split_bundle.py
import json
import os
import time
import requests

API_KEY = os.environ["UNSILOED_API_KEY"]
BASE_URL = "https://prod.visionapi.unsiloed.ai"

categories = [
    {"name": "Invoice", "description": "Vendor invoices with itemized charges and a total due"},
    {"name": "Receipt", "description": "Point-of-sale receipts with line items, tax, and payment method"},
    {"name": "Purchase Order", "description": "Buyer-issued purchase orders authorizing goods or services from a vendor"},
]

with open("bundle.pdf", "rb") as f:
    response = requests.post(
        f"{BASE_URL}/splitter",
        headers={"api-key": API_KEY},
        files={"file": ("bundle.pdf", f, "application/pdf")},
        data={"categories": json.dumps(categories)},
    )
response.raise_for_status()

job_id = response.json()["job_id"]
print(f"Job submitted: {job_id}")

max_attempts = 60  # roughly 5 minutes at 5 seconds per poll
attempts = 0
while True:
    result = requests.get(
        f"{BASE_URL}/splitter/{job_id}",
        headers={"api-key": API_KEY},
    ).json()
    print(f"Status: {result['status']}")
    if result["status"] == "completed":
        break
    if result["status"] == "failed":
        raise RuntimeError(result.get("error", "split job failed"))
    attempts += 1
    if attempts >= max_attempts:
        raise TimeoutError("Split job did not finish within 5 minutes")
    time.sleep(5)

with open("result.json", "w") as f:
    json.dump(result, f, indent=2)

os.makedirs("split_files", exist_ok=True)
for file_info in result["result"]["files"]:
    pdf_bytes = requests.get(file_info["full_path"]).content
    out_path = os.path.join("split_files", file_info["name"])
    with open(out_path, "wb") as out:
        out.write(pdf_bytes)
    print(f"Saved {out_path} (confidence={file_info['confidence_score']:.2%})")

Step 1: Set Up Your Environment

Before writing any code, we need three things: an API key, a bundled PDF, and the runtime for our chosen language.

1.1 Get an Unsiloed AI API Key

To get API access, sign up on Unsiloed AI. Export your key as an environment variable named UNSILOED_API_KEY so it stays out of source control:
export UNSILOED_API_KEY="your-api-key"

1.2 Pick a Bundled PDF

The /splitter endpoint is designed for PDFs that contain more than one logical document. This walkthrough assumes a multi-document PDF saved as bundle.pdf in your working directory. If you don’t have one handy, download our sample bundle (a three-page accounts-payable batch scan: an invoice, a receipt, and a purchase order) and save it as bundle.pdf.

1.3 Install Dependencies

You need Python 3.8 or newer. Install the requests package:
pip install requests

Step 2: Submit the Bundle

Two form fields go up: file for the bundled PDF and categories for a JSON-stringified array of the labels the splitter can choose from. The categories list is the only vocabulary the splitter uses. Pages that don’t fit any category are still grouped under the closest match, so the list needs to cover everything that could plausibly appear in the bundle. The endpoint returns a job_id to poll. All requests go to https://prod.visionapi.unsiloed.ai with the API key in the api-key header.

2.1 Set Up the Script

Create a file called split_bundle.py and start with the imports and configuration:
split_bundle.py
import json
import os
import time
import requests

API_KEY = os.environ["UNSILOED_API_KEY"]
BASE_URL = "https://prod.visionapi.unsiloed.ai"
API_KEY reads your key from the environment so it doesn’t get hard-coded into the file, and BASE_URL points at the Unsiloed AI production endpoint. Both appear in every request below.

2.2 Define the Categories

Decide which document types the bundle might contain. Each category is an object with a name and an optional description; richer descriptions help the splitter pick the right label when categories are similar.
Add the category list to the script:
split_bundle.py
categories = [
    {"name": "Invoice", "description": "Vendor invoices with itemized charges and a total due"},
    {"name": "Receipt", "description": "Point-of-sale receipts with line items, tax, and payment method"},
    {"name": "Purchase Order", "description": "Buyer-issued purchase orders authorizing goods or services from a vendor"},
]
Include every document type the bundle might contain. Pages that don’t match any category are still grouped under the closest match.

2.3 Upload the Bundle

Send the file and categories as a multipart upload to /splitter. The endpoint expects the document under the form field name file and the categories as a JSON-encoded string under categories.
Continue the script by uploading the bundle:
split_bundle.py
with open("bundle.pdf", "rb") as f:
    response = requests.post(
        f"{BASE_URL}/splitter",
        headers={"api-key": API_KEY},
        files={"file": ("bundle.pdf", f, "application/pdf")},
        data={"categories": json.dumps(categories)},
    )
response.raise_for_status()
The raise_for_status() call throws an HTTPError on any non-2xx response, so we don’t need to check .status_code ourselves.

2.4 Capture the Job ID

Read and print the job_id:
split_bundle.py
job_id = response.json()["job_id"]
print(f"Job submitted: {job_id}")
Run the script:
python split_bundle.py
The output should be a single line like Job submitted: 887f26e6-d089-47f6-8def-afe84de40ecd.

Step 3: Poll and Download the Split Files

The job runs asynchronously. We GET /splitter/{job_id} repeatedly until the status is completed, then download each split PDF using the signed URL in the response. The status values the polling loop handles:
  • completed: the split files are ready to download
  • failed: the job errored; check the error field for details
  • queued: the job is waiting to be picked up
  • processing: the job is still running

3.1 Write the Polling Loop

Add a polling loop. The max_attempts cap stops the loop if the job hangs:
split_bundle.py
max_attempts = 60  # roughly 5 minutes at 5 seconds per poll
attempts = 0
while True:
    result = requests.get(
        f"{BASE_URL}/splitter/{job_id}",
        headers={"api-key": API_KEY},
    ).json()
    print(f"Status: {result['status']}")
    if result["status"] == "completed":
        break
    if result["status"] == "failed":
        raise RuntimeError(result.get("error", "split job failed"))
    attempts += 1
    if attempts >= max_attempts:
        raise TimeoutError("Split job did not finish within 5 minutes")
    time.sleep(5)

3.2 Download the Split PDFs

Each entry in result.result.files has a presigned full_path URL that downloads the split PDF. The code below saves the metadata to result.json and writes each split file into a split_files/ directory.
Add the download step:
split_bundle.py
with open("result.json", "w") as f:
    json.dump(result, f, indent=2)

os.makedirs("split_files", exist_ok=True)
for file_info in result["result"]["files"]:
    pdf_bytes = requests.get(file_info["full_path"]).content
    out_path = os.path.join("split_files", file_info["name"])
    with open(out_path, "wb") as out:
        out.write(pdf_bytes)
    print(f"Saved {out_path} (confidence={file_info['confidence_score']:.2%})")
Run the script:
python split_bundle.py
You should see a few Status: processing lines, then Status: completed, then one Saved line per matched category.

Error Responses

Failures fall into two buckets: HTTP errors raised before the job is queued, and a failed status on a job that started but couldn’t complete.

HTTP Errors

The /splitter endpoint returns JSON error bodies with a detail field. The common cases:
  • 401 Unauthorized: body is {"detail":"Invalid API key"}. The api-key header is missing or wrong.
  • 400 Bad Request: body is {"detail":"Invalid JSON format for categories: ..."}. The categories form field isn’t valid JSON.
  • 422 Unprocessable Entity: body is {"detail":[{"type":"missing","loc":["body","categories"],"msg":"Field required","input":null}]}. A required form field (usually file or categories) is missing.
  • 400 Bad Request: body is {"detail":"At least one category is required"}. The categories array is empty.
  • 400 Bad Request: body is {"detail":"Failed to process file: Failed to get PDF page count: ..."}. The upload isn’t a readable PDF.
  • 404 Not Found: body is {"detail":"Job not found"}. The job_id you polled doesn’t exist.

Failed Jobs

A job that was accepted but couldn’t be processed comes back with status: "failed" and a populated error field. The walkthrough’s polling loop raises on this case so you see the message instead of waiting out the timeout:
{
  "job_id": "7b31a7d7-e810-4a0b-931e-fbed0879bab2",
  "status": "failed",
  "progress": "Splitting failed",
  "error": "Failed to classify pages",
  "file_url": "https://example-bucket.s3.amazonaws.com/user_uploads/...",
  "file_name": "bundle.pdf",
  "parameters": { "classes": ["Invoice", "Receipt", "Purchase Order"], "page_count": 3 },
  "result": null
}

Response Shape

A completed response contains job metadata, an echo of the input parameters, and a result.files[] array with one entry per matched category. Each entry carries a presigned full_path URL we can download directly.
{
  "job_id": "887f26e6-d089-47f6-8def-afe84de40ecd",
  "status": "completed",
  "progress": "Starting document splitting...",
  "error": null,
  "file_url": "https://example-bucket.s3.amazonaws.com/user_uploads/...",
  "file_name": "bundle.pdf",
  "parameters": {
    "classes": ["Invoice", "Receipt", "Purchase Order"],
    "page_count": 3,
    "enable_reordering": false,
    "category_descriptions": {
      "Invoice": "Vendor invoices with itemized charges and a total due",
      "Receipt": "Point-of-sale receipts with line items, tax, and payment method",
      "Purchase Order": "Buyer-issued purchase orders authorizing goods or services from a vendor"
    }
  },
  "result": {
    "success": true,
    "message": "Successfully split PDF into 3 files",
    "files": [
      {
        "name": "Invoice.pdf",
        "path": "Invoice.pdf",
        "type": "file",
        "fileId": "359afb3c-2554-4acd-9cb3-be4044d7ec97",
        "full_path": "https://example-bucket.s3.amazonaws.com/files/...",
        "confidence_score": 0.9999976308610644
      },
      {
        "name": "Receipt.pdf",
        "path": "Receipt.pdf",
        "type": "file",
        "fileId": "eca3f2db-e113-4b5f-92db-aab89d417114",
        "full_path": "https://example-bucket.s3.amazonaws.com/files/...",
        "confidence_score": 0.9999976308610644
      },
      {
        "name": "Purchase Order.pdf",
        "path": "Purchase Order.pdf",
        "type": "file",
        "fileId": "1af2c343-9a86-46c5-b8f0-c48cc04d1b6f",
        "full_path": "https://example-bucket.s3.amazonaws.com/files/...",
        "confidence_score": 0.9999976308610644
      }
    ]
  }
}
The fields fall into three broad categories: For downloading the split PDFs:
  • result.files[].full_path: presigned S3 URL to download the split PDF; this is what the walkthrough fetches into split_files/
  • result.files[].name: filename derived from the matched category, suitable for saving to disk
  • result.files[].confidence_score: the splitter’s confidence in the classification, on a 0-1 scale; use it to flag low-confidence splits for human review
  • result.files[].fileId: unique identifier for the split file, useful for tracking or deduplicating downstream
For echoing the request back:
  • parameters.classes: the category names you submitted
  • parameters.category_descriptions: the descriptions you submitted, keyed by category name
  • parameters.page_count: number of pages in the uploaded PDF
  • parameters.enable_reordering: whether the splitter reordered pages within each category after classification; defaults to false
  • file_url: signed URL to the original uploaded bundle
  • file_name: name of the uploaded bundle
For job and progress tracking:
  • status: completed, failed, or an in-progress value such as processing
  • progress: human-readable progress message
  • error: error message if the job failed, otherwise null
  • result.success: whether the split operation succeeded
  • result.message: human-readable success or failure message

Sample Output

Running the script against the sample AP batch produces a split_files/ directory with one PDF per matched category:
split_files/
├── Invoice.pdf          # page 1 — the Greenfield Print & Bindery invoice
├── Receipt.pdf          # page 2 — the Cooper's Office Supply receipt
└── Purchase Order.pdf   # page 3 — the Lighthouse Studios LLC purchase order
All three files report a confidence score above 99%. Open them to confirm each page of the bundle landed in the matching category file.

Next Steps

For more on splitting, including the underlying classification step and the full response shape, see the Splitting overview and the Response Format reference.

Splitting Overview

Learn how the splitter groups pages and where to use it in a pipeline.

Classification

Classify a single document against candidate categories instead of splitting a bundle.

API Reference

Browse the full request and response specs for the splitting endpoint.

FAQ

Check limits, supported formats, and answers to common questions.