Alpha release. Install with pip install adx. APIs may change before v1.0. Roadmap →
Skip to content

Document Model

ADX's canonical document model is a parser-agnostic representation called DocumentGraph. Every supported file format is normalized into this model.

Schema Version

Current schema version: 0.1.0

Core Types

DocumentGraph

The root container. One per uploaded file.

FieldTypeDescription
schema_versionstrModel schema version
documentDocumentFile metadata
pagesPage[]Pages (PDF)
workbookWorkbookWorkbook (Excel/CSV)
sectionsSection[]Document sections
citationsCitation[]All citations
extractionsExtraction[]Extraction results
confidenceConfidenceSummaryOverall confidence

Document

File-level metadata.

FieldTypeDescription
idstrUnique document ID
file_namestrOriginal filename
file_typeFileTypepdf, xlsx, csv
file_sizeintSize in bytes
document_typeDocumentTypeDetected type
page_countintNumber of pages
statusProcessingStatuspending, processing, ready, error

Page

A single PDF page.

FieldTypeDescription
page_numberint1-indexed
widthfloatPage width in points
heightfloatPage height in points
text_blocksTextBlock[]Text on this page
tablesTable[]Tables on this page

TextBlock

A block of text with position and classification.

FieldTypeDescription
textstrThe text content
typeTextBlockTypeheading, paragraph, header, footer
bboxBoundingBoxPosition on page
font_sizefloatFont size
pageintPage number

Table

A table with rows and cells.

FieldTypeDescription
idstrUnique table ID
pageintPage number
row_countintNumber of rows
col_countintNumber of columns
cellsTableCell[]All cells
has_headerboolHas header row
bboxBoundingBoxPosition on page

Workbook

A spreadsheet workbook.

FieldTypeDescription
sheetsSheet[]All sheets

Sheet

A single spreadsheet sheet.

FieldTypeDescription
namestrSheet name
indexintSheet index
row_countintNumber of rows
col_countintNumber of columns
cellsSpreadsheetCell[]All cells
formulasFormula[]All formulas
hiddenboolIs sheet hidden
named_rangesdictNamed range definitions

Citation

Provenance link back to source.

FieldTypeDescription
typeCitationTypepage, table, cell, range, formula
pageintPage number (PDF)
bboxBoundingBoxBounding box (PDF)
sheetstrSheet name (Excel)
cellstrCell address (Excel)
rangestrCell range (Excel)
textstrSource text snippet

Enums

FileType

pdf, xlsx, csv

DocumentType

invoice, contract, financial_model, report, spreadsheet, general

ProcessingStatus

pending, processing, ready, error

TextBlockType

heading, paragraph, header, footer

CitationType

page, table, cell, range, formula

ValidationSeverity

error, warning, info

Documents are not text blobs. Give your agents document tools.