Parse Document
Parse and segment PDFs, images, and Office files into meaningful sections using advanced AI models with flexible customization options.
Overview
The Parse Document endpoint processes PDFs, images (PNG, JPEG, TIFF), and office files (PPT, DOCX, XLSX) documents and breaks them into meaningful sections with detailed analysis including text extraction, image recognition, table parsing, and OCR data. You can provide documents either by direct file upload or by presigned URL. This endpoint supports advanced customization options for fine-tuning the parsing behavior to match your specific use cases. The parse endpoint uploads your document and configuration in a single request:- POST to
/parsewith your file and configuration: the API uploads the document and creates a parse job. - The job is automatically enqueued for processing.
- Poll
GET /parse/{job_id}to track progress and retrieve results.
Request
file or url. If both are provided, file takes precedence.url is not provided.file is not provided.false."smart_layout_detection"(default): Intelligently identifies document structure, headers, sections, and content relationships across the entire document using bounding boxes."page_by_page": Analyzes each page independently as a single segment. Faster for simple documents."advanced_layout_detection": Uses a vision-language model for exhaustive page segmentation. Detects 14 element types (Caption, Footnote, Formula, ListItem, PageFooter, PageHeader, Picture, SectionHeader, Table, Text, Title, KeyValuePair, Signature, Seal). Best for visually complex or unusual layouts.
"auto_detection"(default): Intelligently detects bad quality PDFs, scanned documents, and images, then applies OCR only where needed."force_ocr": Runs OCR on the entire document regardless of quality.
"UnsiloedHawk"(default): Higher accuracy for complex layouts and mixed content. Unrecognized values also fall back to this engine."UnsiloedBeta": Handles rotated/warped text and irregular bounding boxes."UnsiloedStorm": Enterprise-grade accuracy optimized for 50+ languages.
"standard": Good balance of speed and accuracy."advanced": Higher quality, best for complex layouts, rotated text, and mixed-language content.
false.false.merge_tables is enabled. Groups larger than this are split. Defaults to 20.false.pii_block_severity, the task is rejected and no parsing occurs. Defaults to false.detect_pii is enabled: any (default) blocks on any PII found; low blocks on quasi-identifiers (names, dates, locations) or higher; medium blocks on contact PII (email, phone) or higher; high blocks only on direct identifiers (SSN, passport, credit card). Ignored if detect_pii is false.standard (default) or advanced (higher precision, additional processing cost). Ignored if detect_pii is false.["Table", "Formula", "Picture"]. Defaults to ["Table", "Picture"]; an empty or unparseable value also falls back to that default, so Table and Picture validation runs even when this field is omitted.validate_segments: ["Table"] instead. Defaults to false."all" to include everything. Defaults to "all".Available segment types:table: Tabular data segmentspicture: Image and graphic segmentsformula: Mathematical equationstext: Regular text contentsectionheader: Section headerstitle: Document titleslistitem: List itemscaption: Image captionsfootnote: Footnotespageheader: Page headerspagefooter: Page footerskeyvaluepair: Key-value pairs (advanced layout detection)signature: Signatures (advanced layout detection)seal: Seals and stamps (advanced layout detection)page: Full-page segments
"table", "table,picture", "table,formula", "picture,formula".false.false to exclude them and reduce response size. All fields default to true. Ignored when response_profile is slim or full (the profile wins).Available fields:html: HTML representation of segmentsmarkdown: Markdown representation of segmentsocr: Raw OCR text data with bounding boxes and confidence scoresimage: Cropped segment images (base64 encoded)content: Text content of segmentsbbox: Bounding box coordinatesconfidence: Confidence scores for segmentsembed: Vector embeddings / embed textchart_data: Extracted chart data for Picture segments identified as charts
{"html": true, "markdown": true, "ocr": false, "image": false}.slim, full, or custom. Omit to return the full shape."slim": Returns only the essentials per chunk —embed,bbox,page_number,segment_id,segment_type, and HTML for tables / Markdown for everything else. Dropscontent,image,ocr,confidence,chart_data,page_height,page_width. Best for embedding-only workflows where you want the smallest payload."full": Every field returned (equivalent to omitting this param)."custom": Honoroutput_fieldsverbatim.
response_profile and output_fields are provided, the profile wins — output_fields is only consulted for custom or when the profile is omitted. Applies to inline JSON responses only; GET /parse/{job_id}?output_file=true returns a presigned URL to the stored full-shape output file.html:"VLM"or"Auto"markdown:"VLM"or"Auto"model_id(Table):"astra","us_table_v1","us_table_v2"model_id(Picture/Formula):"nova","luna","sol"use_table_ocr(Table only): Advanced OCR optimized for tabular data. Better handles bordered cells, gridlines, and complex table layouts.vlm: Custom prompt for the VLM model. Use this to give the model specific instructions for extracting or describing these segment types.translation: Optional per-segment translation, e.g.{"provider": "Auto", "target_language": "en"}.provideris"Auto"for fast machine translation or"VLM"/"LLM"for model-based translation;target_languageis an ISO 639-1 code, or"auto"to auto-detect the source and translate to English. Optionalmodel_idandpromptapply to model-based translation.
segment_analysis. If both are provided, segment_processing takes precedence."1-5", "2,4,6", "[1,3,5]". Defaults to all pages."Unsiloed" (default) uses names like PageHeader, ListItem, Picture. "Other" uses alternative names like Header, List Item, Figure.false.false.["docx"]. When set, the pipeline generates the requested export files after parsing completes. The exported files are available as presigned URLs in the exports field of the response. Supported values: "docx", "markdown", "json".["docx"]), not a repeated field. Passing a bare value like docx will fail to parse and silently skip the export."Continue" (default) proceeds despite errors (e.g., LLM refusals on individual segments). "Fail" stops and fails the task on any error.POST /parse — the task is not auto-deleted. To get a presigned-upload TTL, use POST /v2/parse/upload instead, where expires_in controls the upload URL’s validity.Configuration Best Practices
High-Accuracy Processing
High-Accuracy Processing
- Legal documents requiring precise text extraction
- Financial statements with complex tables
- Archival documents with low-quality scans
- Documents where accuracy is more important than speed
- Latency: +2-3 seconds per page for high resolution
- Latency: +1-2 seconds per page for segment validation
Fast Processing
Fast Processing
- High-volume document processing
- Real-time applications requiring quick results
- Documents with simple layouts
- Pre-screened high-quality digital documents
- Fastest processing time
- Lower cost per document
- Suitable for batch processing large volumes
Financial Documents (Tables + Charts)
Financial Documents (Tables + Charts)
- Balance sheets and P&L statements
- Quarterly/annual financial reports
- Investment reports with charts
- Documents where only structured data matters
- Reduced response size (text content filtered out)
- Focus on data-rich content
- Merged multi-page tables for complete datasets
Data Extraction Only (Tables)
Data Extraction Only (Tables)
- Extracting data from invoices
- Processing structured forms
- Database population from documents
- CSV/Excel export workflows
- Minimal response payload
- Faster data transfer
- Easy integration with data pipelines
Academic/Research Documents
Academic/Research Documents
- Research papers with bibliographies
- Academic articles with citations
- Scientific documents
- Literature reviews
- Automatic citation extraction and linking
- Structured bibliography metadata
- In-text citation hyperlinks in markdown
- Preserves academic document structure
Scanned Documents
Scanned Documents
- Scanned paper documents
- Low-quality photocopies
- Historical documents
- Image-based PDFs
- Maximum OCR coverage
- Better text extraction from poor quality sources
- Higher accuracy for challenging documents
Output Fields Optimization
Output Fields Optimization
output_fields or set all fields to True to include all available data.Benefits:- Reduced response size and bandwidth usage
- Faster processing and data transfer
- Cost optimization for high-volume processing
Parameter Details
File Input Options
The API supports two methods for providing the document to process:- Direct File Upload (
fileparameter): Upload the document file directly as multipart/form-data - Presigned URL (
urlparameter): Provide a publicly accessible URL or presigned URL to the document
- You must provide either
fileorurl, but not both - When using
url, the document will be downloaded from the provided URL before processing - Presigned URLs are ideal for documents already stored in cloud storage (S3, GCS, Azure Blob, etc.)
- The URL must be publicly accessible or include necessary authentication parameters (e.g., S3 presigned URLs with signatures)
- Supported formats are the same for both methods: PDF, images (PNG, JPEG, TIFF), and office documents (PPT, PPTX, DOC, DOCX, XLS, XLSX)
- Documents already stored in cloud storage
- Avoiding duplicate file uploads
- Integration with existing document management systems
- Processing large files without upload overhead
Segmentation Method
Thelayout_analysis parameter controls how the document is analyzed and segmented:
-
"smart_layout_detection"(default): Analyzes pages for layout elements (e.g., Table, Picture, Formula, etc.) using bounding boxes. Provides fine-grained segmentation and better chunking for complex documents. -
"page_by_page": Treats each page as a single segment. Faster processing, ideal for simple documents without complex layouts. -
"advanced_layout_detection": Uses a vision-language model to exhaustively segment each page into 14 element types (includingKeyValuePair,Signature, andSealin addition to the standard set). Recommended for documents with dense, non-standard, or visually complex layouts where VGT-based detection misses regions.
Agentic OCR
Theagentic_ocr parameter enables per-segment OCR enhancement after layout detection, yielding higher accuracy on small text, stylized fonts, and mathematical formulas.
Values:
"standard": Fast, good for most documents."advanced": Higher quality, better for complex layouts, rotated or irregular text, and multilingual content.
OCR Mode
Theocr_strategy parameter controls optical character recognition processing:
-
"auto_detection"(default): Intelligently determines when OCR is needed based on the document content. Balances accuracy and performance. -
"force_ocr": Applies OCR to all content regardless of existing text layer. Use this for scanned documents or when maximum text extraction is required.
Table Merging
Themerge_tables parameter enables merging of tables that span across multiple pages:
How It Works:
- Analyzes consecutive table segments across pages
- Identifies tables with matching column headers
- Merges them into a single unified table structure
- Preserves table formatting and data integrity
- Multi-Page Financial Statements: Consolidate P&L statements or balance sheets spanning multiple pages
- Large Data Tables: Merge inventory lists, transaction records, or data sets split across pages
- Reports with Continuation Tables: Automatically combine tables marked with “continued on next page”
- Simplified Data Processing: Work with complete tables instead of fragments
- Better Context: Maintain full table context for analysis and extraction
- Reduced Post-Processing: Eliminates need for manual table stitching
Citation Extraction (Research Papers)
xml_citation parameter enables automatic extraction and linking of citations from research papers, academic articles, and scientific documents.
How It Works:
- Extracts structured bibliography from the document
- Identifies in-text citation references (e.g., “Chen et al., 2021”)
- Hyperlinks citations in the markdown output to their bibliography entries
- Returns structured citation metadata in the response
metadata field with structured citation data:
- Original:
"As shown by Chen et al. (2021)..." - Enhanced:
"As shown by [Chen et al. (2021)](#ref-5)..."
Content Type Filtering
Thesegment_filter parameter allows you to filter the output to include only specific segment types, reducing response size and focusing on relevant content:
How It Works:
- Accepts a comma-separated list of segment types (case-insensitive)
- Filters segments after processing is complete
- Removes chunks that have no segments after filtering
"all"(default): Include all segment types"table": Only table segments"picture": Only image/graphic segments"table,picture": Tables and pictures only"table,formula": Tables and formulas only- Custom combinations using any segment type
table,picture,formula,text,sectionheader,title,listitem,caption,footnote,pageheader,pagefooter
- Tables Only: Extract only tabular data from financial documents
- Pictures Only: Extract charts, graphs, and diagrams for visual analysis
- Tables + Pictures: Get structured data and visualizations, skip text content
- Custom Combinations: Mix any segment types based on your needs
- Reduced Response Size: Filter out unwanted content before receiving results
- Faster Processing: Less data to transfer and parse
- Focused Extraction: Get only the content types you need
- Cost Optimization: Smaller responses reduce bandwidth usage
Output Fields Configuration
Theoutput_fields parameter allows you to control which fields are included in the API response. This is useful for reducing response size, improving performance, and optimizing bandwidth usage when you don’t need all available data.
Available Fields:
html(default:true): Include HTML representation of segmentsmarkdown(default:true): Include Markdown representation of segmentsocr(default:true): Include OCR results with bounding boxes and confidence scoresimage(default:true): Include cropped segment images (base64 encoded)content(default:true): Include text content of segmentsbbox(default:true): Include bounding box coordinatesconfidence(default:true): Include confidence scores for segmentsembed(default:true): Include embed text in chunk responses
false to exclude them from the response. Fields not specified default to true for backward compatibility.
Example Configuration:
- Reduced Response Size: Excluding large fields like
imageandhtmlcan significantly reduce payload size - Faster Processing: Less data to serialize and transfer
- Cost Optimization: Smaller responses reduce bandwidth costs
- Selective Data: Only retrieve the fields you need for your use case
- Minimal Response: Set most fields to
falsewhen you only need basic content - Text-Only Processing: Exclude
imageandocrwhen processing text content - Embedding Generation: Include only
contentandembedwhen generating embeddings - Full Analysis: Keep all fields enabled (default) for comprehensive document analysis
Segment Analysis Configuration
Thesegment_analysis parameter allows you to customize how different segment types are processed, including HTML/Markdown generation strategies and which field should populate the content field.
Available Segment Types:
You can configure processing for any of the following segment types:
Table: Tabular data segmentsPicture: Image and graphic segmentsFormula: Mathematical equationsTitle: Document titlesSectionHeader: Section headersText: Regular text contentListItem: List itemsCaption: Image captionsFootnote: FootnotesPageHeader: Page headersPageFooter: Page footersPage: Full page segments
html: Generation strategy for HTML representation"Auto"(default): Automatically determine the best method"VLM": Use VLM to generate HTML
markdown: Generation strategy for Markdown representation"Auto"(default): Automatically determine the best method"VLM": Use VLM to generate Markdown
content_source: Defines which field should populate thecontentfield in the response"OCR"(default): Use OCR text for content"HTML": Use HTML representation as content"Markdown": Use Markdown representation as content"VLM"(alias"LLM"): Use the VLM-generated representation as content
model_id(Table segments only): Specifies which AI model to use for table processing"us_table_v1": Standard table processing model"us_table_v2": Enhanced table processing model with improved accuracy
vlm: Custom prompt for the VLM model. Use this to give the model specific instructions for extracting or describing these segment types.translation: Optional per-segment translation configuration:provider:"Auto"(fast machine translation) or"VLM"/"LLM"(model-based translation)target_language: ISO 639-1 code (e.g."en","es","fr","ko"), or"auto"to auto-detect the source language and translate to Englishmodel_id(optional): model for VLM/LLM translation; defaults to the provider defaultprompt(optional): custom instructions appended to the translation system prompt
content_source Works:
The content_source parameter determines which field’s value will be used to populate the content field in the segment response:
- When
content_sourceis set to"HTML", thecontentfield will contain the HTML representation, and the separatehtmlandmarkdownfields will be empty - When
content_sourceis set to"Markdown", thecontentfield will contain the Markdown representation, and the separatehtmlandmarkdownfields will be empty - When
content_sourceis set to"OCR"(default), thecontentfield contains OCR text, andhtmlandmarkdownfields are populated separately
- HTML as Content: Set
content_source: "HTML"for Table segments when you want HTML-formatted table data directly in thecontentfield - Markdown as Content: Set
content_source: "Markdown"for Picture segments when you want Markdown-formatted descriptions in thecontentfield - VLM-Enhanced Output: Use
"VLM"for bothhtmlandmarkdowngeneration strategies to get AI-enhanced representations in those fields
Response
GET /parse/{job_id} to poll for results."Starting" on creation."unknown" when no usable segment exists.merge_tables value).Document Analysis Features
The parsing endpoint provides comprehensive document analysis including:Text Extraction
Extracts text content with high accuracy, preserving formatting and structure.Image Recognition
Identifies and analyzes images within documents, providing descriptions and metadata.Table Parsing
Extracts tabular data with proper structure and formatting.OCR Processing
Performs optical character recognition on text elements with confidence scores.Section Detection
Automatically identifies different document sections like headers, body text, and captions.Bounding Box Information
Provides precise coordinates for all extracted elements.Advanced Content Processing
- VLM-Enhanced Analysis: Uses vision-language models for better content understanding
- Multi-Format Output: Generates HTML, Markdown, and plain text versions
- Context-Aware Processing: Maintains document context across segments
- Intelligent Chunking: Creates semantically meaningful document chunks
Retrieving Results
After the job is created, use the GET /parse/ endpoint to check status and retrieve results:Expected Results Structure
When the job completes successfully, the response contains comprehensive document analysis with enhanced processing:Segment Types
The parsing API identifies and processes different types of document segments with enhanced processing:Picture
Images and graphics within the document, including logos, charts, and illustrations. Enhanced with VLM-based description generation.SectionHeader
Document headers and titles that define section boundaries. Processed with semantic understanding.Text
Regular text content including paragraphs, sentences, and individual text elements. Enhanced with context-aware processing.Table
Tabular data with structured rows and columns. Enhanced with VLM-based formatting and extended context options. You can configure the table processing model usingmodel_id in the segment_analysis parameter:
us_table_v1: Standard table processing modelus_table_v2: Enhanced table processing model with improved accuracy
Caption
Text captions associated with images or figures. Processed with relationship awareness.Formula
Mathematical equations and expressions. Enhanced with specialized formula processing.Title
Document titles and main headings. Processed with enhanced formatting.Footnote
Document footnotes and references. Processed with context linking.ListItem
Bulleted and numbered list items. Processed with structure preservation. Each segment includes detailed metadata such as confidence scores, bounding boxes, OCR data, and formatted output in both HTML and Markdown with VLM enhancement.Error Handling
Common Error Scenarios
- Invalid API Key: Authentication failed
- File Too Large: File exceeds size limits
- Invalid Configuration: Malformed processing parameters
- Server Error: Internal processing error
- Processing Timeout: Task took too long to complete
- Missing File or URL: Neither
filenorurlparameter provided - Both File and URL Provided: Cannot provide both
fileandurlsimultaneously - Invalid URL: URL is not accessible or malformed
- URL Download Failed: Unable to download document from provided URL
- Insufficient Quota (
402): Not enough page credits remaining. - Usage Limit Exceeded (
429): Billing usage cap reached. Returns plain text:Usage limit exceeded. NoRetry-Afterheader. - Rate Limit Exceeded (
429): Org exceeded its per-second request budget (default 10 requests per second, configurable per organization). Returns JSON{"error": "rate_limit_exceeded", "message": ..., "retry_after": 1}with aRetry-After: 1header. - Internal Server Error (
500): An unexpected error occurred during processing. - Service Unavailable (
503): Job queue is at capacity. Retry after the duration indicated in theRetry-Afterheader. - Forbidden (
403): Access has been revoked.
Authorizations
API key for authentication. Use 'Bearer <your_api_key>'
Body
Provide either file (binary upload, multipart only) or url (presigned/public URL, both content types), not both. JSON callers send all fields as native JSON values; multipart callers send each field as a form part. The file field is multipart-only.
Request body for POST /parse (multipart/form-data).
Provide either file (binary upload) or url (presigned/public URL) — not both.
Document file to process. Required if url is not provided.
Supported formats: PDF, PNG, JPEG, TIFF, PPT, PPTX, DOC, DOCX, XLS, XLSX.
Enable per-segment agentic OCR for higher accuracy. Pass "standard" or "advanced".
JSON object for chunk processing configuration.
Run a PII pre-check before parsing. When enabled, the document is scanned
for personally identifiable information before any extraction work happens.
If PII is found at or above pii_block_severity, the task is rejected and
no parsing occurs (the job ends in a failed state with a PII reason).
Defaults to false.
Fix the reading order of detected segments. Defaults to false.
Error handling strategy for non-critical processing errors.
Continue (default) — proceed despite errors (e.g., LLM refusals).
Fail — stop and fail the task on any error.
Reserved field. Persisted in the task configuration but currently has no
effect on retention for this endpoint — POST /parse (multipart and JSON/Form)
does not set the task's expires_at column, and the cleanup job only deletes
AwaitingUpload rows past their expires_at. To get a presigned-upload TTL,
use POST /v2/parse/upload instead, where expires_in controls the upload
URL's validity.
Export format(s) to generate after processing.
When set, the pipeline generates the requested export files after parsing completes.
The exported files are available as presigned URLs in the exports field of the response.
Supported: ["docx", "markdown", "json"].
File format for exporting parsed results. When specified in a parse request,
the pipeline generates the requested export file after processing completes.
The exported file is available via the exports field in the task response.
docx, markdown, json ["docx", "markdown", "json"]Transfer text color from the PDF text layer to OCR results. Defaults to false.
Attach hyperlink URLs from PDF annotations to OCR results. Defaults to false.
Preserve strikethrough formatting in HTML/Markdown output. Defaults to false.
Layout analysis strategy.
smart_layout_detection (default) — detects layout elements using bounding boxes.
page_by_page — treats each page as a single segment; faster for simple documents.
advanced_layout_detection — higher-accuracy layout detection for complex pages
(multi-column layouts, dense tables/figures); slower than smart_layout_detection.
JSON object for LLM processing configuration.
Maximum number of tables per merge group when merge_tables is enabled.
Groups larger than this are split into separate merges. Defaults to 20.
Merge tables that span multiple pages into a single unified structure. Defaults to false.
OCR engine to use for text recognition.
UnsiloedBeta (default) — handles irregular bounding boxes, rotated/warped text.
UnsiloedHawk — higher accuracy, better for complex layouts.
UnsiloedStorm — enterprise-grade accuracy, optimized for 50+ languages.
OCR strategy.
auto_detection (default) — applies OCR only where needed.
force_ocr — applies OCR to all content regardless of existing text layer.
JSON object filtering which fields appear on each segment / chunk.
Each key defaults to true; set a key to false to drop the field.
Keys: bbox, chart_data, confidence, content, embed, html,
image, markdown, ocr. Example: {"html": false, "ocr": false}.
Ignored when response_profile is slim or full.
Page range to process. Formats: "1-5", "2,4,6", "[1,3,5]". Defaults to all pages.
Severity threshold at which a detected PII finding blocks the task.
Ignored when detect_pii is false. Findings strictly below the threshold
are allowed through; findings at or above it reject the task.
any(default) — block on any detection, regardless of severity.low— block on low, medium, or high severity findings.medium— block on medium or high severity findings.high— block only on high severity findings.
PII detector engine to use when detect_pii is true. Ignored otherwise.
standard(default) — fast pattern-based detector; low latency, well-suited to bulk pre-screening.advanced— model-based detector; slower but catches contextual cases that pattern matching misses (e.g. handwritten names, partially redacted IDs, document-style references to a person).
Response shape selector: slim, full, or custom.
slim: chunkembed+ bbox + page_number + segment_id + segment_type + HTML for tables / Markdown for everything else. Dropscontent,image,ocr,confidence,chart_data,page_height,page_width.full: every field returned (equivalent to omitting this param).custom: honoroutput_fieldsverbatim.
Precedence: when both response_profile and output_fields are
provided, the profile wins (output_fields only matters for custom
or when the profile is omitted).
Applies to inline JSON responses only — GET /parse/{job_id}?output_file=true
returns a presigned URL to the stored full-shape output file.
"slim"
JSON object controlling HTML/Markdown generation strategy and AI model per segment type.
Example: {"Table": {"html": "LLM", "markdown": "LLM", "model_id": "us_table_v2"}}.
Content filter: comma-separated segment types to keep.
Example: "table,picture". Use "all" to include everything. Defaults to "all".
Alias for segment_analysis (Core Parser name). If both are provided, this takes precedence.
Segment type naming convention.
Unsiloed (default) — e.g., PageHeader, ListItem, Picture.
Other — alternative names e.g., Header, List Item, Figure.
Presigned or public URL of the document to fetch and process.
Required if file is not provided.
Use high-resolution images for cropping and post-processing.
Latency penalty: ~2–3 s per page. Defaults to true.
JSON array string of segment types to validate with VLM.
Example: ["Table", "Formula", "Picture"]. Defaults to [].
Legacy: validate table segment classifications using VLM.
Prefer validate_segments: ["Table"] instead. Defaults to false.
Extract and hyperlink bibliography citations in the markdown output. PDFs only.
Defaults to false.
Response
Job created — poll with GET /parse/{job_id} to retrieve results.
Response body for a successful POST /parse call.
ISO 8601 timestamp when the job was created.
Number of pages deducted from your quota for this job.
Name of the uploaded file or "unknown" when a URL was provided.
Job identifier — pass this to GET /parse/{job_id} to poll for results.
Whether table merging is enabled for this job (reflects the submitted merge_tables value).
Human-readable status message with a polling hint.
Remaining page quota after this job was deducted.
Initial job status. Always "Starting" on creation.

