Extracting data from documents
To process documents, you must specify which data points, or fields, you want to extract. If your project includes different document types, like a mix of passports and driverās licenses, create a class for each document type and specify fields for each class.
You can create up to 250 classes per project, and up to 100 fields per class.
Creating classes
If your project includes different document types, start by creating a class for each document type. You can then specify a different set of fields for each class.
Organization members can import prebuilt classes from a library of established schemas, such as paystubs, invoices, bank statements, and utility bills.
In projects with classification, a default class called other is assigned to documents that canāt be classified. You canāt delete or modify this class.
-
In the editing panel, click the Create classes icon
, then select one of these options based on your AI Hub subscription and project requirements:-
Create classes ā Lets you create a custom class without any fields. If you select this option, enter a succinct name for your document type, then click ā to exit the class editing panel.
-
Browse prebuilt classes Commercial & Enterprise ā Lets you add common document types and their associated fields based on a library of available schemas. If you select this option, choose the prebuilt classes that you want to add and click Add to project.
-
-
Use the Create classes icon
to add more classes as needed. -
When youāre done creating classes, click Classify documents.
Commercial & Enterprise If your project includes multipage files, youāre prompted to optionally enable splitting files. You can split by documentāwhich lets the model determine where document breaks occurāor split each pageāwhich creates a new document at every page break. After classifying your documents, page ranges indicate how files are split.
Classes are assigned to your documents and documents are grouped by class in the document list. Any documents that canāt be classified are assigned the other class.
-
Verify classification. If documents werenāt classified as expected, edit classes to improve your results.
-
In a class that wasnāt identified accurately, click the overflow icon
, then select Edit class. -
Enter a description to help the model more accurately identify documents in the class, then click ā to exit the class editing panel.
Effective descriptions include unique identifying details about a document class. Use details related to text in the documents, rather than visual elements like color, which the model canāt āsee.ā
In projects that donāt use file splitting, you can reference file extensions to help classify documents. For example, the description for an Images class might be Files with image file extensions, like JPEG, PNG, and TIF.
As a best practice, limit class descriptions to 1,000 characters (4,000 maximum).
-
Use the overflow icon to edit more classes as needed.
-
When youāre done editing classes, click Classify documents.
-
Creating fields
Create fields for each of the data points you want to identify.
-
In the editing panel, click Add field.
-
Enter a field name or select a suggested field name, then press Enter.
Data is extracted based on field name alone and the result is displayed.
-
Do one of the following, based on whether your result is accurate:
-
Accurate result ā Click ā to exit the field editing panel and continue adding fields.
-
Inaccurate result ā Edit the field. When youāre done editing, click ā to exit the field editing panel and continue adding fields.
-
Editing fields
If field name alone doesnāt return the results you expect, you can edit fields to provide more guidance.
Access the field editor for an existing field by hovering over the field and clicking the edit icon
.In the field editor, first choose the field type appropriate for the data you want to identify.
-
Text extraction ā Used to extract a string of text or numbers, such as address, account balance, or filing status.
-
Table extraction Commercial & Enterprise ā Used to extract tables. For more details, see Extracting tables.
-
List extraction Commercial & Enterprise ā Used to extract a list of items, such as deposits on a banking summary, billing codes on a medical claim form, properties on a broker submission, or items on a receipt. If there are additional data points associated with each item that you want to identify, such as price and SKU for receipt items, you can add an attribute for up to 30 data points. For best results, limit attributes to 10.
-
Document reasoning ā Used to generate results that arenāt explicitly found in the document, but can be deduced, summarized, or calculated. Unless you specify otherwise in your prompt, document reasoning fields assume the current date and time.
-
Visual reasoning Commercial & Enterprise ā Used to analyze visual and stylistic elements, including elements that OCR doesnāt capture, such as images, watermarks, layout, colors, text styling, and handwritten markup. Source highlighting for visual reasoning fields indicates relevant page, not specific visual elements.
Visual reasoning fields require the advanced model, which is charged at a higher rate. For details, see the pricing policy. -
Derived Commercial & Enterprise ā Used to generate values based on preceding fields in the class. Reference fields by field name: either type the field name or select it from the dropdown. For example, Identify the state in Customer address. If necessary, you can reorder fields to enable referencing.
When referencing table or list extraction fields, derived fields try to match the input format. For consistent results, especially when combining different field types, consider normalizing tables or lists as text first. -
Custom function Commercial & Enterprise ā Used to compute values or import third-party data with a custom Python function. For more details, see Custom function fields.
For extraction and reasoning fields, if necessary, provide a more detailed description or prompt to indicate what information youāre looking for. As a best practice, keep field and attribute names under 48 characters and use a description or prompt for longer content up to 1,000 characters (4,000 maximum).
In reasoning fields, you can use the Enhance prompt option to make improvements to your prompt. Prompt enhancement relies on the selected model to optimize your prompt, checking for clarity, concision, and coherence while eliminating contradictions and redundancies.
For most field types, you can change the model using the model selector dropdown. For details about model capabilities, see Choosing a model.
When youāre done editing a field, click Run to see results and further refine your edits if needed.
Extracting objects
You can extract objects such as tables, checkboxes, signatures, and barcodes using specific settings and prompts.
Extracting tables
The method for extracting tables differs for community users and organization members.
At all product tiers, table extraction is subject to these limitations:
-
Multipage tables might not be extracted correctly unless they have consistent headers on all pages.
-
Source highlighting for tables indicates entire tables, not individual rows, columns, or cells.
To extract tables as a community user, use a document reasoning field and describe the table extraction you want to perform in the prompt.
You can extract multipage tables and perform some table manipulation with document reasoning fields, however this method requires more trial and error than the commercially supported method.
Here are some examples of prompts for tables in document reasoning fields:
-
Extract transactions as a Markdown table
-
Extract transactions as JSON
-
Extract all tables with columns Date, Description, Debit, Credit
-
Extract transactions and filter for amounts greater than $1,000
To extract tables as an organization user, use a table extraction field with a brief field name and a description. This method extracts tables with a high degree of accuracy and lets you manipulate tables in various ways.
Here are some examples of descriptions for table extraction fields:
-
Extract tables with columns Date, Description, Debit, Credit
-
Extract transactions and filter for amounts greater than $1,000
-
Extract transactions and return results for 01 April through 15 April
-
Extract transactions and sort amounts from smallest to largest
-
Extract transactions and add a column Flagged with values set to Yes if the debit is greater than $70
Extracting checkboxes
Checkboxes can be extracted with either an extraction or reasoning field.
-
For a group of checkboxes with a label, such as the Filing Status field on a tax form, use the label for your field name.
-
For a standalone checkbox, use a question that indicates whether the checkbox is ticked. For example, Is the filer claiming capital gains or losses?
Extracting signatures
You can extract information about signatures, including whether a document is signed, who the signer was, and the signature date. Extraction of signature images isnāt available.
Signature information can be extracted with either an extraction or reasoning field.
Often, a field name like signatory name, signatory title, or signature date is adequate. If field name alone fails to extract the data you want, edit the field and provide a description or prompt. For example:
-
Extract all signatures
-
Return yes if this document is signed
Extracting barcodes
You can extract information about barcodes, including their presence, quantity, and if printed, associated numeric values.
Barcode information can be extracted with either an extraction or reasoning field.
Often, a field name like barcode value is adequate. If field name alone fails to extract the data you want, edit the field and provide a description or prompt. For example:
-
Return all barcode values
-
Return yes if this document contains a barcode
Custom function fields
Commercial & EnterpriseThe custom function field type lets you use a Python function to compute values or import data to your project schema.
For example, you might use a custom function to calculate total invoice amount using existing subtotal and tax rate fields:
Custom function fields accept these parameters:
For additional guidance about custom functions, see Writing custom functions.
Reordering fields
To change the order of fields in the field editor, use the up and down arrows that display when you hover over a field.
Reordering fields can be necessary when creating derived fields, which can reference fields that precede it in the field editor. Additionally, reordering fields can be helpful to speed up reviews or support downstream integrations, because fields are displayed in processed results in the same order as in the field editor.