Getting Started With Splitting

Mail rooms, AP scans, and patient intake forms often arrive as one big PDF with several different documents stacked together. The /splitter endpoint takes that bundle and a list of categories, then returns one labeled PDF per matched category. For other endpoints, see the Parse quickstart, Extraction quickstart, or Classification quickstart.

This walkthrough builds a script that ships a bundled PDF and a category list to /splitter, waits for the split to complete, and writes one labeled PDF per matched category into a local split_files/ directory, ready to drop into per-category downstream pipelines. If you’d rather just copy the whole script, it’s in the dropdown below.

Show the Full Script

Set UNSILOED_API_KEY in your environment and save the bundled PDF as bundle.pdf in the same directory before running.

Python
JavaScript
cURL

split_bundle.py

import json
import os
import time
import requests

API_KEY = os.environ["UNSILOED_API_KEY"]
BASE_URL = "https://prod.visionapi.unsiloed.ai"

categories = [
    {"name": "Invoice", "description": "Vendor invoices with itemized charges and a total due"},
    {"name": "Receipt", "description": "Point-of-sale receipts with line items, tax, and payment method"},
    {"name": "Purchase Order", "description": "Buyer-issued purchase orders authorizing goods or services from a vendor"},
]

with open("bundle.pdf", "rb") as f:
    response = requests.post(
        f"{BASE_URL}/splitter",
        headers={"api-key": API_KEY},
        files={"file": ("bundle.pdf", f, "application/pdf")},
        data={"categories": json.dumps(categories)},
    )
response.raise_for_status()

job_id = response.json()["job_id"]
print(f"Job submitted: {job_id}")

max_attempts = 60  # roughly 5 minutes at 5 seconds per poll
attempts = 0
while True:
    result = requests.get(
        f"{BASE_URL}/splitter/{job_id}",
        headers={"api-key": API_KEY},
    ).json()
    print(f"Status: {result['status']}")
    if result["status"] == "completed":
        break
    if result["status"] == "failed":
        raise RuntimeError(result.get("error", "split job failed"))
    attempts += 1
    if attempts >= max_attempts:
        raise TimeoutError("Split job did not finish within 5 minutes")
    time.sleep(5)

with open("result.json", "w") as f:
    json.dump(result, f, indent=2)

os.makedirs("split_files", exist_ok=True)
for file_info in result["result"]["files"]:
    pdf_bytes = requests.get(file_info["full_path"]).content
    out_path = os.path.join("split_files", file_info["name"])
    with open(out_path, "wb") as out:
        out.write(pdf_bytes)
    print(f"Saved {out_path} (confidence={file_info['confidence_score']:.2%})")

Save this as script.mjs or set "type": "module" in your package.json. Requires Node.js 18 or newer for the global fetch, FormData, and Blob.

script.mjs

import fs from "node:fs";
import path from "node:path";

const API_KEY = process.env.UNSILOED_API_KEY;
const BASE_URL = "https://prod.visionapi.unsiloed.ai";

const categories = [
  { name: "Invoice", description: "Vendor invoices with itemized charges and a total due" },
  { name: "Receipt", description: "Point-of-sale receipts with line items, tax, and payment method" },
  { name: "Purchase Order", description: "Buyer-issued purchase orders authorizing goods or services from a vendor" },
];

const form = new FormData();
form.append("file", new Blob([fs.readFileSync("bundle.pdf")]), "bundle.pdf");
form.append("categories", JSON.stringify(categories));

const response = await fetch(`${BASE_URL}/splitter`, {
  method: "POST",
  headers: { "api-key": API_KEY },
  body: form,
});
if (!response.ok) throw new Error(`${response.status}: ${await response.text()}`);

const { job_id } = await response.json();
console.log(`Job submitted: ${job_id}`);

const maxAttempts = 60; // roughly 5 minutes at 5 seconds per poll
let attempts = 0;
let result;
while (true) {
  const res = await fetch(`${BASE_URL}/splitter/${job_id}`, {
    headers: { "api-key": API_KEY },
  });
  result = await res.json();
  console.log(`Status: ${result.status}`);
  if (result.status === "completed") break;
  if (result.status === "failed") throw new Error(result.error || "split job failed");
  if (++attempts >= maxAttempts) throw new Error("Split job did not finish within 5 minutes");
  await new Promise((r) => setTimeout(r, 5000));
}

fs.writeFileSync("result.json", JSON.stringify(result, null, 2));

fs.mkdirSync("split_files", { recursive: true });
for (const file of result.result.files) {
  const res = await fetch(file.full_path);
  const buf = Buffer.from(await res.arrayBuffer());
  const outPath = path.join("split_files", file.name);
  fs.writeFileSync(outPath, buf);
  console.log(`Saved ${outPath} (confidence=${(file.confidence_score * 100).toFixed(2)}%)`);
}

# Submit the bundle and capture the job_id from the response:
resp=$(curl -sX POST "https://prod.visionapi.unsiloed.ai/splitter" \
  -H "api-key: $UNSILOED_API_KEY" \
  -F "file=@bundle.pdf" \
  -F 'categories=[{"name":"Invoice","description":"Vendor invoices with itemized charges and a total due"},{"name":"Receipt","description":"Point-of-sale receipts with line items, tax, and payment method"},{"name":"Purchase Order","description":"Buyer-issued purchase orders authorizing goods or services from a vendor"}]')
JOB_ID=$(echo "$resp" | grep -o '"job_id":"[^"]*"' | cut -d'"' -f4)
echo "Job submitted: $JOB_ID"

# Poll until the job finishes, with a 5-minute timeout:
attempts=0
max_attempts=60
while true; do
  resp=$(curl -sX GET "https://prod.visionapi.unsiloed.ai/splitter/$JOB_ID" \
    -H "api-key: $UNSILOED_API_KEY")
  status=$(echo "$resp" | grep -o '"status":"[^"]*"' | head -1 | cut -d'"' -f4)
  echo "Status: $status"
  [ "$status" = "completed" ] && break
  [ "$status" = "failed" ] && { echo "Job failed"; exit 1; }
  attempts=$((attempts + 1))
  [ "$attempts" -ge "$max_attempts" ] && { echo "Split job did not finish within 5 minutes"; exit 1; }
  sleep 5
done

# Save the full response and download each split file:
echo "$resp" > result.json
mkdir -p split_files
echo "$resp" \
  | python3 -c "import json,sys; [print(f['name'], f['full_path']) for f in json.load(sys.stdin)['result']['files']]" \
  | while read name url; do
      curl -s "$url" -o "split_files/$name"
      echo "Saved split_files/$name"
    done

Step 1: Set Up Your Environment

Before writing any code, we need three things: an API key, a bundled PDF, and the runtime for our chosen language.

1.1 Get an Unsiloed AI API Key

To get API access, sign up on Unsiloed AI. Export your key as an environment variable named UNSILOED_API_KEY so it stays out of source control:

export UNSILOED_API_KEY="your-api-key"

1.2 Pick a Bundled PDF

The /splitter endpoint is designed for PDFs that contain more than one logical document. This walkthrough assumes a multi-document PDF saved as bundle.pdf in your working directory. If you don’t have one handy, download our sample bundle (a three-page accounts-payable batch scan: an invoice, a receipt, and a purchase order) and save it as bundle.pdf.

1.3 Install Dependencies

Python
JavaScript
cURL

You need Python 3.8 or newer. Install the requests package:

pip install requests

You need Node.js 18 or newer for the global fetch, FormData, and Blob. No external packages needed.

Step 2: Submit the Bundle

Two form fields go up: file for the bundled PDF and categories for a JSON-stringified array of the labels the splitter can choose from. The categories list is the only vocabulary the splitter uses. Pages that don’t fit any category are still grouped under the closest match, so the list needs to cover everything that could plausibly appear in the bundle. The endpoint returns a job_id to poll. All requests go to https://prod.visionapi.unsiloed.ai with the API key in the api-key header.

2.1 Set Up the Script

Python
JavaScript
cURL

Create a file called split_bundle.py and start with the imports and configuration:

split_bundle.py

import json
import os
import time
import requests

API_KEY = os.environ["UNSILOED_API_KEY"]
BASE_URL = "https://prod.visionapi.unsiloed.ai"

API_KEY reads your key from the environment so it doesn’t get hard-coded into the file, and BASE_URL points at the Unsiloed AI production endpoint. Both appear in every request below.

Create a file called script.mjs and start with the imports and configuration:

script.mjs

import fs from "node:fs";
import path from "node:path";

const API_KEY = process.env.UNSILOED_API_KEY;
const BASE_URL = "https://prod.visionapi.unsiloed.ai";

API_KEY reads your key from the environment so it doesn’t get hard-coded into the file, and BASE_URL points at the Unsiloed AI production endpoint. Both appear in every request below.

2.2 Define the Categories

Decide which document types the bundle might contain. Each category is an object with a name and an optional description; richer descriptions help the splitter pick the right label when categories are similar.

Python
JavaScript
cURL

Add the category list to the script:

split_bundle.py

categories = [
    {"name": "Invoice", "description": "Vendor invoices with itemized charges and a total due"},
    {"name": "Receipt", "description": "Point-of-sale receipts with line items, tax, and payment method"},
    {"name": "Purchase Order", "description": "Buyer-issued purchase orders authorizing goods or services from a vendor"},
]

Add the category list to the script:

script.mjs

const categories = [
  { name: "Invoice", description: "Vendor invoices with itemized charges and a total due" },
  { name: "Receipt", description: "Point-of-sale receipts with line items, tax, and payment method" },
  { name: "Purchase Order", description: "Buyer-issued purchase orders authorizing goods or services from a vendor" },
];

Include every document type the bundle might contain. Pages that don’t match any category are still grouped under the closest match.

2.3 Upload the Bundle

Send the file and categories as a multipart upload to /splitter. The endpoint expects the document under the form field name file and the categories as a JSON-encoded string under categories.

Python
JavaScript
cURL

Continue the script by uploading the bundle:

split_bundle.py

with open("bundle.pdf", "rb") as f:
    response = requests.post(
        f"{BASE_URL}/splitter",
        headers={"api-key": API_KEY},
        files={"file": ("bundle.pdf", f, "application/pdf")},
        data={"categories": json.dumps(categories)},
    )
response.raise_for_status()

The raise_for_status() call throws an HTTPError on any non-2xx response, so we don’t need to check .status_code ourselves.

Continue the script by uploading the bundle:

script.mjs

const form = new FormData();
form.append("file", new Blob([fs.readFileSync("bundle.pdf")]), "bundle.pdf");
form.append("categories", JSON.stringify(categories));

const response = await fetch(`${BASE_URL}/splitter`, {
  method: "POST",
  headers: { "api-key": API_KEY },
  body: form,
});
if (!response.ok) throw new Error(`${response.status}: ${await response.text()}`);

fetch doesn’t throw on non-2xx responses by default, so we check response.ok and raise the error ourselves.

Run:

curl -X POST "https://prod.visionapi.unsiloed.ai/splitter" \
  -H "api-key: $UNSILOED_API_KEY" \
  -F "file=@bundle.pdf" \
  -F 'categories=[{"name":"Invoice","description":"Vendor invoices with itemized charges and a total due"},{"name":"Receipt","description":"Point-of-sale receipts with line items, tax, and payment method"},{"name":"Purchase Order","description":"Buyer-issued purchase orders authorizing goods or services from a vendor"}]'

The response prints to stdout. We need the job_id field for the next step.

2.4 Capture the Job ID

Python
JavaScript
cURL

Read and print the job_id:

split_bundle.py

job_id = response.json()["job_id"]
print(f"Job submitted: {job_id}")

Run the script:

python split_bundle.py

The output should be a single line like Job submitted: 887f26e6-d089-47f6-8def-afe84de40ecd.

Read and log the job_id:

script.mjs

const { job_id } = await response.json();
console.log(`Job submitted: ${job_id}`);

Run the script:

node script.mjs

The output should be a single line like Job submitted: 887f26e6-d089-47f6-8def-afe84de40ecd.

The response body from the POST above looks like:

{
  "job_id": "887f26e6-d089-47f6-8def-afe84de40ecd",
  "status": "processing",
  "quota_remaining": 7698
}

Copy the job_id value; you’ll paste it into the polling command in the next step.

Step 3: Poll and Download the Split Files

The job runs asynchronously. We GET /splitter/{job_id} repeatedly until the status is completed, then download each split PDF using the signed URL in the response. The status values the polling loop handles:

completed: the split files are ready to download
failed: the job errored; check the error field for details
queued: the job is waiting to be picked up
processing: the job is still running

3.1 Write the Polling Loop

Python
JavaScript
cURL

Add a polling loop. The max_attempts cap stops the loop if the job hangs:

split_bundle.py

max_attempts = 60  # roughly 5 minutes at 5 seconds per poll
attempts = 0
while True:
    result = requests.get(
        f"{BASE_URL}/splitter/{job_id}",
        headers={"api-key": API_KEY},
    ).json()
    print(f"Status: {result['status']}")
    if result["status"] == "completed":
        break
    if result["status"] == "failed":
        raise RuntimeError(result.get("error", "split job failed"))
    attempts += 1
    if attempts >= max_attempts:
        raise TimeoutError("Split job did not finish within 5 minutes")
    time.sleep(5)

Add a polling loop. The maxAttempts cap stops the loop if the job hangs:

script.mjs

const maxAttempts = 60; // roughly 5 minutes at 5 seconds per poll
let attempts = 0;
let result;
while (true) {
  const res = await fetch(`${BASE_URL}/splitter/${job_id}`, {
    headers: { "api-key": API_KEY },
  });
  result = await res.json();
  console.log(`Status: ${result.status}`);
  if (result.status === "completed") break;
  if (result.status === "failed") throw new Error(result.error || "split job failed");
  if (++attempts >= maxAttempts) throw new Error("Split job did not finish within 5 minutes");
  await new Promise((r) => setTimeout(r, 5000));
}

Replace JOB_ID below with the value you captured from Step 2.4, then run this loop. It polls every 5 seconds and gives up after 5 minutes if the job hasn’t completed:

JOB_ID="paste-job-id-here"
attempts=0
max_attempts=60  # roughly 5 minutes at 5 seconds per poll

while true; do
  resp=$(curl -sX GET "https://prod.visionapi.unsiloed.ai/splitter/$JOB_ID" \
    -H "api-key: $UNSILOED_API_KEY")
  status=$(echo "$resp" | grep -o '"status":"[^"]*"' | head -1 | cut -d'"' -f4)
  echo "Status: $status"
  [ "$status" = "completed" ] && break
  [ "$status" = "failed" ] && { echo "Job failed"; exit 1; }
  attempts=$((attempts + 1))
  [ "$attempts" -ge "$max_attempts" ] && { echo "Split job did not finish within 5 minutes"; exit 1; }
  sleep 5
done

The loop keeps the latest response body in $resp for the next step.

3.2 Download the Split PDFs

Each entry in result.result.files has a presigned full_path URL that downloads the split PDF. The code below saves the metadata to result.json and writes each split file into a split_files/ directory.

Python
JavaScript
cURL

Add the download step:

split_bundle.py

with open("result.json", "w") as f:
    json.dump(result, f, indent=2)

os.makedirs("split_files", exist_ok=True)
for file_info in result["result"]["files"]:
    pdf_bytes = requests.get(file_info["full_path"]).content
    out_path = os.path.join("split_files", file_info["name"])
    with open(out_path, "wb") as out:
        out.write(pdf_bytes)
    print(f"Saved {out_path} (confidence={file_info['confidence_score']:.2%})")

Run the script:

python split_bundle.py

You should see a few Status: processing lines, then Status: completed, then one Saved line per matched category.

Add the download step:

script.mjs

fs.writeFileSync("result.json", JSON.stringify(result, null, 2));

fs.mkdirSync("split_files", { recursive: true });
for (const file of result.result.files) {
  const res = await fetch(file.full_path);
  const buf = Buffer.from(await res.arrayBuffer());
  const outPath = path.join("split_files", file.name);
  fs.writeFileSync(outPath, buf);
  console.log(`Saved ${outPath} (confidence=${(file.confidence_score * 100).toFixed(2)}%)`);
}

Run the script:

node script.mjs

You should see a few Status: processing lines, then Status: completed, then one Saved line per matched category.

The polling loop in Step 3.1 left the full response in $resp. Save the metadata, then download each split file using the presigned full_path URLs:

echo "$resp" > result.json
mkdir -p split_files

echo "$resp" \
  | python3 -c "import json,sys; [print(f['name'], f['full_path']) for f in json.load(sys.stdin)['result']['files']]" \
  | while read name url; do
      curl -s "$url" -o "split_files/$name"
      echo "Saved split_files/$name"
    done

The split_files/ directory now holds one PDF per matched category, named after it.

Error Responses

Failures fall into two buckets: HTTP errors raised before the job is queued, and a failed status on a job that started but couldn’t complete.

HTTP Errors

The /splitter endpoint returns JSON error bodies with a detail field. The common cases:

401 Unauthorized: body is {"detail":"Invalid API key"}. The api-key header is missing or wrong.
400 Bad Request: body is {"detail":"Invalid JSON format for categories: ..."}. The categories form field isn’t valid JSON.
422 Unprocessable Entity: body is {"detail":[{"type":"missing","loc":["body","categories"],"msg":"Field required","input":null}]}. A required form field (usually file or categories) is missing.
400 Bad Request: body is {"detail":"At least one category is required"}. The categories array is empty.
400 Bad Request: body is {"detail":"Failed to process file: Failed to get PDF page count: ..."}. The upload isn’t a readable PDF.
404 Not Found: body is {"detail":"Job not found"}. The job_id you polled doesn’t exist.

Failed Jobs

A job that was accepted but couldn’t be processed comes back with status: "failed" and a populated error field. The walkthrough’s polling loop raises on this case so you see the message instead of waiting out the timeout:

{
  "job_id": "7b31a7d7-e810-4a0b-931e-fbed0879bab2",
  "status": "failed",
  "progress": "Splitting failed",
  "error": "Failed to classify pages",
  "file_url": "https://example-bucket.s3.amazonaws.com/user_uploads/...",
  "file_name": "bundle.pdf",
  "parameters": { "classes": ["Invoice", "Receipt", "Purchase Order"], "page_count": 3 },
  "result": null
}

Response Shape

A completed response contains job metadata, an echo of the input parameters, and a result.files[] array with one entry per matched category. Each entry carries a presigned full_path URL we can download directly.

{
  "job_id": "887f26e6-d089-47f6-8def-afe84de40ecd",
  "status": "completed",
  "progress": "Starting document splitting...",
  "error": null,
  "file_url": "https://example-bucket.s3.amazonaws.com/user_uploads/...",
  "file_name": "bundle.pdf",
  "parameters": {
    "classes": ["Invoice", "Receipt", "Purchase Order"],
    "page_count": 3,
    "enable_reordering": false,
    "category_descriptions": {
      "Invoice": "Vendor invoices with itemized charges and a total due",
      "Receipt": "Point-of-sale receipts with line items, tax, and payment method",
      "Purchase Order": "Buyer-issued purchase orders authorizing goods or services from a vendor"
    }
  },
  "result": {
    "success": true,
    "message": "Successfully split PDF into 3 files",
    "files": [
      {
        "name": "Invoice.pdf",
        "path": "Invoice.pdf",
        "type": "file",
        "fileId": "359afb3c-2554-4acd-9cb3-be4044d7ec97",
        "full_path": "https://example-bucket.s3.amazonaws.com/files/...",
        "confidence_score": 0.9999976308610644
      },
      {
        "name": "Receipt.pdf",
        "path": "Receipt.pdf",
        "type": "file",
        "fileId": "eca3f2db-e113-4b5f-92db-aab89d417114",
        "full_path": "https://example-bucket.s3.amazonaws.com/files/...",
        "confidence_score": 0.9999976308610644
      },
      {
        "name": "Purchase Order.pdf",
        "path": "Purchase Order.pdf",
        "type": "file",
        "fileId": "1af2c343-9a86-46c5-b8f0-c48cc04d1b6f",
        "full_path": "https://example-bucket.s3.amazonaws.com/files/...",
        "confidence_score": 0.9999976308610644
      }
    ]
  }
}

The fields fall into three broad categories: For downloading the split PDFs:

result.files[].full_path: presigned S3 URL to download the split PDF; this is what the walkthrough fetches into split_files/
result.files[].name: filename derived from the matched category, suitable for saving to disk
result.files[].confidence_score: the splitter’s confidence in the classification, on a 0-1 scale; use it to flag low-confidence splits for human review
result.files[].fileId: unique identifier for the split file, useful for tracking or deduplicating downstream

For echoing the request back:

parameters.classes: the category names you submitted
parameters.category_descriptions: the descriptions you submitted, keyed by category name
parameters.page_count: number of pages in the uploaded PDF
parameters.enable_reordering: whether the splitter reordered pages within each category after classification; defaults to false
file_url: signed URL to the original uploaded bundle
file_name: name of the uploaded bundle

For job and progress tracking:

status: completed, failed, or an in-progress value such as processing
progress: human-readable progress message
error: error message if the job failed, otherwise null
result.success: whether the split operation succeeded
result.message: human-readable success or failure message

Sample Output

Running the script against the sample AP batch produces a split_files/ directory with one PDF per matched category:

split_files/
├── Invoice.pdf          # page 1 — the Greenfield Print & Bindery invoice
├── Receipt.pdf          # page 2 — the Cooper's Office Supply receipt
└── Purchase Order.pdf   # page 3 — the Lighthouse Studios LLC purchase order

All three files report a confidence score above 99%. Open them to confirm each page of the bundle landed in the matching category file.

Next Steps

For more on splitting, including the underlying classification step and the full response shape, see the Splitting overview and the Response Format reference.

Splitting Overview

Learn how the splitter groups pages and where to use it in a pipeline.

Classification

Classify a single document against candidate categories instead of splitting a bundle.

API Reference

Browse the full request and response specs for the splitting endpoint.

FAQ

Check limits, supported formats, and answers to common questions.

​Step 1: Set Up Your Environment

​1.1 Get an Unsiloed AI API Key

​1.2 Pick a Bundled PDF

​1.3 Install Dependencies

​Step 2: Submit the Bundle

​2.1 Set Up the Script

​2.2 Define the Categories

​2.3 Upload the Bundle

​2.4 Capture the Job ID

​Step 3: Poll and Download the Split Files

​3.1 Write the Polling Loop

​3.2 Download the Split PDFs

​Error Responses

​HTTP Errors

​Failed Jobs

​Response Shape

​Sample Output

​Next Steps

Splitting Overview

Classification

API Reference

FAQ

Step 1: Set Up Your Environment

1.1 Get an Unsiloed AI API Key

1.2 Pick a Bundled PDF

1.3 Install Dependencies

Step 2: Submit the Bundle

2.1 Set Up the Script

2.2 Define the Categories

2.3 Upload the Bundle

2.4 Capture the Job ID

Step 3: Poll and Download the Split Files

3.1 Write the Polling Loop

3.2 Download the Split PDFs

Error Responses

HTTP Errors

Failed Jobs

Response Shape

Sample Output

Next Steps