Schema-Driven Extraction

ADX extracts structured fields from documents using predefined schemas. Every extracted field includes a confidence score and citation back to the source.

Built-in Schemas

Schema	Fields	Use Case
`invoice`	vendor, invoice_number, date, line_items, subtotal, tax, total, currency	Invoice processing
`contract`	parties, effective_date, expiration_date, governing_law, termination_clause	Contract review
`financial_model`	revenue, expenses, net_income, assumptions, growth_rate	Financial analysis
`table`	headers, rows	Generic table extraction

Using a Schema

python

from adx import ADX

dn = ADX()
doc_id = dn.upload("invoice.pdf")

# Extract with a built-in schema
extraction = dn.extract(doc_id, schema="invoice")

# Each field has a value, confidence, and citation
for field in extraction["fields"]:
    print(f"{field['name']}: {field['value']}")
    print(f"  confidence: {field['confidence']}")
    print(f"  citation: page {field['citation']['page']}")

Auto-Detection

If no schema is specified, ADX uses the document type detected during profiling:

python

# Auto-detects schema from document type
extraction = dn.extract(doc_id)

Custom Schemas

Define custom extraction schemas using JSON Schema format:

python

schema = {
    "name": "purchase_order",
    "fields": [
        {"name": "po_number", "type": "string", "required": True},
        {"name": "vendor", "type": "string", "required": True},
        {"name": "total", "type": "number", "required": True},
        {"name": "delivery_date", "type": "string", "required": False}
    ]
}

extraction = dn.extract(doc_id, schema=schema)

How Extraction Works

Label search — finds field labels in the document (e.g., "Invoice Number", "Invoice #", "Inv No.")
Value extraction — reads the adjacent text or cell value
Type coercion — converts to the expected type (string, number, integer)
Citation creation — links each value to its source location
Confidence scoring — based on label match quality and parser confidence

For PDFs, extraction uses text proximity — finding values near label text blocks. For spreadsheets, it uses cell adjacency — finding values in cells next to label cells.

Field-Level Citations

Every extracted field includes a citation:

json

{
  "name": "total",
  "value": 4250.00,
  "confidence": 0.95,
  "citation": {
    "type": "page",
    "page": 3,
    "bbox": [72, 680, 540, 700],
    "text": "Total: $4,250.00"
  }
}

For spreadsheets, citations reference cell addresses:

json

{
  "name": "revenue",
  "value": 1500000,
  "confidence": 0.9,
  "citation": {
    "type": "cell",
    "sheet": "Summary",
    "cell": "B5",
    "text": "1500000"
  }
}

Schema-Driven Extraction ​

Built-in Schemas ​

Using a Schema ​

Auto-Detection ​

Custom Schemas ​

How Extraction Works ​

Field-Level Citations ​

Schema-Driven Extraction

Built-in Schemas

Using a Schema

Auto-Detection

Custom Schemas

How Extraction Works

Field-Level Citations