Extracting data from documents

To process documents, you must specify which data points, or fields, you want to extract. If your project includes different document types, like a mix of passports and driver’s licenses, create a class for each document type and specify fields for each class.

You can create up to 250 classes per project, and up to 100 fields per class.

Creating classes

If your project includes different document types, start by creating a class for each document type. You can then specify a different set of fields for each class.

Organization members can import prebuilt classes from a library of established schemas, such as paystubs, invoices, bank statements, and utility bills.

In projects with classification, a default class called other is assigned to documents that can’t be classified. You can’t delete or modify this class.

  1. In the editing panel, click the Create classes icon Icon that looks like a horizontal bookmark or tab, with a small plus in the lower right., then select one of these options based on your AI Hub subscription and project requirements:

    • Create classes — Lets you create a custom class without any fields. If you select this option, enter a succinct name for your document type, then click ← to exit the class editing panel.

    • Browse prebuilt classes Commercial & Enterprise — Lets you add common document types and their associated fields based on a library of available schemas. If you select this option, choose the prebuilt classes that you want to add and click Add to project.

  2. Use the Create classes icon Icon that looks like a horizontal bookmark or tab, with a small plus in the lower right. to add more classes as needed.

  3. When you’re done creating classes, click Classify documents.

    Commercial & Enterprise If your project includes multipage files, you’re prompted to optionally enable splitting files. You can split by document—which lets the model determine where document breaks occur—or split each page—which creates a new document at every page break. After classifying your documents, page ranges indicate how files are split.

    Classes are assigned to your documents and documents are grouped by class in the document list. Any documents that can’t be classified are assigned the other class.

  4. Verify classification. If documents weren’t classified as expected, edit classes to improve your results.

    1. In a class that wasn’t identified accurately, click the overflow icon Icon that looks like an ellipsis, with three horizontal dots., then select Edit class.

    2. Enter a description to help the model more accurately identify documents in the class, then click ← to exit the class editing panel.

      Effective descriptions include unique identifying details about a document class. Use details related to text in the documents, rather than visual elements like color, which the model can’t ā€œsee.ā€

      In projects that don’t use file splitting, you can reference file extensions to help classify documents. For example, the description for an Images class might be Files with image file extensions, like JPEG, PNG, and TIF.

      As a best practice, limit class descriptions to 1,000 characters (4,000 maximum).

    3. Use the overflow icon to edit more classes as needed.

    4. When you’re done editing classes, click Classify documents.

Creating fields

Create fields for each of the data points you want to identify.

  1. In the editing panel, click Add field.

  2. Enter a field name or select a suggested field name, then press Enter.

    Data is extracted based on field name alone and the result is displayed.

  3. Do one of the following, based on whether your result is accurate:

    • Accurate result — Click ← to exit the field editing panel and continue adding fields.

    • Inaccurate result — Edit the field. When you’re done editing, click ← to exit the field editing panel and continue adding fields.

Editing fields

If field name alone doesn’t return the results you expect, you can edit fields to provide more guidance.

Access the field editor for an existing field by hovering over the field and clicking the edit icon Pencil icon..

In the field editor, first choose the field type appropriate for the data you want to identify.

  • Text extraction — Used to extract a string of text or numbers, such as address, account balance, or filing status.

  • Table extraction Commercial & Enterprise — Used to extract tables. For more details, see Extracting tables.

  • List extraction Commercial & Enterprise — Used to extract a list of items, such as deposits on a banking summary, billing codes on a medical claim form, properties on a broker submission, or items on a receipt. If there are additional data points associated with each item that you want to identify, such as price and SKU for receipt items, you can add an attribute for up to 30 data points. For best results, limit attributes to 10.

  • Document reasoning — Used to generate results that aren’t explicitly found in the document, but can be deduced, summarized, or calculated. Unless you specify otherwise in your prompt, document reasoning fields assume the current date and time.

  • Visual reasoning Commercial & Enterprise — Used to analyze visual and stylistic elements, including elements that OCR doesn’t capture, such as images, watermarks, layout, colors, text styling, and handwritten markup. Source highlighting for visual reasoning fields indicates relevant page, not specific visual elements.

    Visual reasoning fields require the advanced model, which is charged at a higher rate. For details, see the pricing policy.
  • Derived Commercial & Enterprise — Used to generate values based on preceding fields in the class. Reference fields by field name: either type the field name or select it from the dropdown. For example, Identify the state in Customer address. If necessary, you can reorder fields to enable referencing.

    When referencing table or list extraction fields, derived fields try to match the input format. For consistent results, especially when combining different field types, consider normalizing tables or lists as text first.
  • Custom function Commercial & Enterprise — Used to compute values or import third-party data with a custom Python function. For more details, see Custom function fields.

For extraction and reasoning fields, if necessary, provide a more detailed description or prompt to indicate what information you’re looking for. As a best practice, keep field and attribute names under 48 characters and use a description or prompt for longer content up to 1,000 characters (4,000 maximum).

In reasoning fields, you can use the Enhance prompt option to make improvements to your prompt. Prompt enhancement relies on the selected model to optimize your prompt, checking for clarity, concision, and coherence while eliminating contradictions and redundancies.

For most field types, you can change the model using the model selector dropdown. For details about model capabilities, see Choosing a model.

When you’re done editing a field, click Run to see results and further refine your edits if needed.

Extracting objects

You can extract objects such as tables, checkboxes, signatures, and barcodes using specific settings and prompts.

For best results when extracting tables and checkboxes, enable Tables and Checkboxes in digitization settings.

Extracting tables

The method for extracting tables differs for community users and organization members.

At all product tiers, table extraction is subject to these limitations:

  • Multipage tables might not be extracted correctly unless they have consistent headers on all pages.

  • Source highlighting for tables indicates entire tables, not individual rows, columns, or cells.

To see the tables identified in a document, with tables enabled in digitization settings, select the prediction icon Lightbulb icon. in the header and enable Show detected objects for tables. Tables in the document are highlighted and you can use the adjacent table icon to view, copy, or download a table.
Community

To extract tables as a community user, use a document reasoning field and describe the table extraction you want to perform in the prompt.

You can extract multipage tables and perform some table manipulation with document reasoning fields, however this method requires more trial and error than the commercially supported method.

Here are some examples of prompts for tables in document reasoning fields:

  • Extract transactions as a Markdown table

  • Extract transactions as JSON

  • Extract all tables with columns Date, Description, Debit, Credit

  • Extract transactions and filter for amounts greater than $1,000

Commercial & Enterprise

To extract tables as an organization user, use a table extraction field with a brief field name and a description. This method extracts tables with a high degree of accuracy and lets you manipulate tables in various ways.

Here are some examples of descriptions for table extraction fields:

  • Extract tables with columns Date, Description, Debit, Credit

  • Extract transactions and filter for amounts greater than $1,000

  • Extract transactions and return results for 01 April through 15 April

  • Extract transactions and sort amounts from smallest to largest

  • Extract transactions and add a column Flagged with values set to Yes if the debit is greater than $70

Extracting checkboxes

Checkboxes can be extracted with either an extraction or reasoning field.

  • For a group of checkboxes with a label, such as the Filing Status field on a tax form, use the label for your field name.

  • For a standalone checkbox, use a question that indicates whether the checkbox is ticked. For example, Is the filer claiming capital gains or losses?

Extracting signatures

You can extract information about signatures, including whether a document is signed, who the signer was, and the signature date. Extraction of signature images isn’t available.

Signature information can be extracted with either an extraction or reasoning field.

Often, a field name like signatory name, signatory title, or signature date is adequate. If field name alone fails to extract the data you want, edit the field and provide a description or prompt. For example:

  • Extract all signatures

  • Return yes if this document is signed

Extracting barcodes

You can extract information about barcodes, including their presence, quantity, and if printed, associated numeric values.

Barcode information can be extracted with either an extraction or reasoning field.

Often, a field name like barcode value is adequate. If field name alone fails to extract the data you want, edit the field and provide a description or prompt. For example:

  • Return all barcode values

  • Return yes if this document contains a barcode

Custom function fields

Commercial & Enterprise

The custom function field type lets you use a Python function to compute values or import data to your project schema.

For example, you might use a custom function to calculate total invoice amount using existing subtotal and tax rate fields:

1subtotal = float(subtotal)
2tax_rate = float(tax_rate) / 100
3tax_amount = subtotal * tax_rate
4total_amount = subtotal + tax_amount
5
6return round(total_amount, 2)

Custom function fields accept these parameters:

ParameterRequired?Description
contextRequiredStores metadata about the document.
context['document_text']OptionalRetrieves the entire text of the document.
context['file_path']OptionalRetrieves the path to the uploaded file.
keysOptionalAccess custom variables and organization secrets. Use keys['custom']['<key-name>'] for custom keys and keys['secret']['<key-name>'] for secret keys.
<additional-field-name>OptionalWhen writing custom functions in automation projects, click Add argument to select additional fields in the class to use in the function.

For additional guidance about custom functions, see Writing custom functions.

Reordering fields

To change the order of fields in the field editor, use the up and down arrows that display when you hover over a field.

Reordering fields can be necessary when creating derived fields, which can reference fields that precede it in the field editor. Additionally, reordering fields can be helpful to speed up reviews or support downstream integrations, because fields are displayed in processed results in the same order as in the field editor.

If you have derived fields or custom functions in your project that reference preceding fields, be aware that reordering fields can break the reference.