profile_document
File metadata, page count, detected type, confidence scores, and recommended next tools.
pip install adx. APIs may change before v1.0. Roadmap → Profile. Inspect. Extract. Validate. Cite.
ADX gives your AI agents structured document tools.
Documents are not text blobs. Give your agents document tools.
Upload any document. ADX wraps best-in-class parsers and exposes a canonical, citeable, inspectable document model to your agents.
Agent: reads invoice.pdf as raw text
Agent: tries regex on flattened string
Agent: misses table spanning pages 2-3
Agent: extracts wrong total
Agent: no way to verify the answerRaw text dumps lose structure, tables, and provenance. The agent hallucinates fields it cannot cite.
from adx import ADX
dn = ADX()
doc_id = dn.upload("invoice.pdf")
profile = dn.profile(doc_id)
extraction = dn.extract(doc_id, schema="invoice")
result = dn.validate(doc_id, extraction.id){"field": "total", "value": 4250.00,
"confidence": 0.95,
"citation": {"page": 3, "bbox": [72, 680, 540, 700]}}Every field is extracted with a citation. Every value is validated.
ADX is not a parser. It is an orchestration layer that:
DocumentGraph from any supported file formatpip install adx
# Start the server
adx serve
# Upload and profile a document
curl -X POST http://localhost:8000/v1/files -F "file=@invoice.pdf"
curl http://localhost:8000/v1/files/{id}/profileOr use the Python SDK directly:
from adx import ADX
dn = ADX()
doc_id = dn.upload("invoice.pdf")
print(dn.profile(doc_id))
print(dn.structure(doc_id))
extraction = dn.extract(doc_id, schema="invoice")
print(dn.validate(doc_id, extraction.id))Nine read-only, deterministic tools that return structured JSON:
File metadata, page count, detected type, confidence scores, and recommended next tools.
Headings, sections, table locations, and page-level outline for navigation.
Full-text search with page/cell locations and surrounding context snippets.
Retrieve a specific page's text blocks or a table's rows with bounding boxes.
Navigate spreadsheet sheets and read cell ranges with formulas and computed values.
Search cells by value or pattern. Trace formula dependencies and precedents.
Built-in schemas for invoices, contracts, and financial models. Field-level citations included.
Rule-based checks: required fields, type coercion, arithmetic verification, citation provenance.
Upload, inspect, extract, and validate documents with a single client object.
from adx import ADX
dn = ADX()
doc_id = dn.upload("report.pdf")
profile = dn.profile(doc_id)
tables = dn.structure(doc_id)["tables"]
data = dn.get_table(doc_id, tables[0]["id"])Full HTTP API for language-agnostic integration. Start with adx serve.
# Upload
curl -X POST localhost:8000/v1/files -F "file=@model.xlsx"
# Extract with schema
curl -X POST localhost:8000/v1/files/{id}/extract \
-H "Content-Type: application/json" \
-d '{"schema": "financial_model"}'Command-line tools for scripting and CI pipelines.
adx upload invoice.pdf
adx profile <id>
adx extract <id> --schema invoice
adx validate <id> --extraction <eid>
adx export <id> --format markdown| Format | Parser | Features |
|---|---|---|
| PyMuPDF | Text blocks, tables, bounding boxes, section detection | |
| Excel (.xlsx) | openpyxl | Sheets, formulas, named ranges, hidden cells, merged cells |
| Excel (.xls) | xlrd | Legacy Excel format support |
| DOCX | python-docx | Paragraphs, headings, tables, images, metadata |
| PPTX | python-pptx | Slides, text shapes, tables, layout extraction |
| RTF | striprtf | Text extraction with formatting stripped |
| CSV | stdlib csv | Dialect sniffing, encoding detection, header inference |