Extracting data from documents

To process documents, you must specify which data points, or fields, you want to extract. If your project includes different document types, like a mix of passports and driver’s licenses, create a class for each document type and specify fields for each class.

You can create up to 250 classes per project, and up to 100 fields per class.

Creating classes

If your project includes different document types, start by creating a class for each document type. You can then specify a different set of fields for each class.

Organization members can import prebuilt classes from a library of established schemas, such as paystubs, invoices, bank statements, and utility bills.

In projects with classification, a default class called other is assigned to documents that can’t be classified. You can’t delete or modify this class.

  1. In the editing panel, click the Create classes icon Icon that looks like a horizontal bookmark or tab, with a small plus in the lower right., then select one of these options based on your AI Hub subscription and project requirements:

    • Create classes — Lets you create a custom class without any fields. If you select this option, enter a succinct name for your document type, then click ← to exit the class editing panel.

    • Browse prebuilt classes Commercial & Enterprise — Lets you add common document types and their associated fields based on a library of available schemas. If you select this option, choose the prebuilt classes that you want to add and click Add to project.

  2. Use the Create classes icon Icon that looks like a horizontal bookmark or tab, with a small plus in the lower right. to add more classes as needed.

  3. When you’re done creating classes, click Classify documents.

    Commercial & Enterprise If your project includes multipage files, you’re prompted to optionally enable splitting files. You can split by document—which lets the model determine where document breaks occur—or split each page—which creates a new document at every page break. After classifying your documents, page ranges indicate how files are split.

    Classes are assigned to your documents and documents are grouped by class in the document list. Any documents that can’t be classified are assigned the other class.

  4. Verify classification. If documents weren’t classified as expected, edit classes to improve your results.

    1. In a class that wasn’t identified accurately, click the overflow icon Icon that looks like an ellipsis, with three horizontal dots., then select Edit class.

    2. Enter a description to help the model more accurately identify documents in the class, then click ← to exit the class editing panel.

      Effective descriptions include unique identifying details about a document class. Use details related to text in the documents, rather than visual elements like color, which the model can’t “see.”

      In projects that don’t use file splitting, you can reference file extensions to help classify documents. For example, the description for an Images class might be Files with image file extensions, like JPEG, PNG, and TIF.

      As a best practice, limit class descriptions to 1,000 characters (4,000 maximum).

    3. Use the overflow icon to edit more classes as needed.

    4. When you’re done editing classes, click Classify documents.

Creating fields

Create fields for each of the data points you want to identify.

  1. In the editing panel, click Add field.

  2. Enter a field name or select a suggested field name, then press Enter.

    Data is extracted based on field name alone and the result is displayed.

  3. Do one of the following, based on whether your result is accurate:

    • Accurate result — Click ← to exit the field editing panel and continue adding fields.

    • Inaccurate result — Edit the field. When you’re done editing, click ← to exit the field editing panel and continue adding fields.

Editing fields

If field name alone doesn’t return the results you expect, you can edit fields to provide more guidance.

Access the field editor for an existing field by hovering over the field and clicking the edit icon Pencil icon..

In the field editor, first choose the field type appropriate for the data you want to identify.

With a suitable field type selected, if necessary, provide a more detailed description or prompt describing the information you’re looking for. As a best practice, keep field and attribute names under 48 characters and use a description or prompt for longer content up to 1,000 characters (4,000 maximum). For best practices, see Writing effective prompts.

For most field types, you can change the model using the model selector dropdown.

  • Use the standard model for straightforward fields that perform basic text extraction or calculations. The standard model tends to perform best on shorter documents less than 50 pages. Its faster processing is suitable when speed is your priority.

  • Use the advanced model for specialized fields that perform multistep reasoning or complex math. The advanced model performs better on longer documents and those with challenging formatting, and it’s required for visual reasoning fields. Its more deliberate processing is suitable when accuracy is your priority.

For details about model capabilities, see Choosing a model.

When you’re done editing a field, click Run to see results and further refine your edits if needed.

Field types

Choose the field type appropriate for the data you want to identify.

Field typeCommercial+Used to…
Text extractionExtract strings of text or numbers, such as address, account balance, or filing status.
Table extraction✓Extract structured tabular data from documents.
List extraction✓Extract multiple similar items with optional attributes, such as transactions or line items.
Document reasoningGenerate results that aren’t explicitly stated, through deduction, summarization, or calculation.
Visual reasoning✓Analyze visual and stylistic elements including images, watermarks, layout, and formatting. Requires the advanced model.
Derived✓Generate values based on other fields in the class.
Custom function✓Compute values or import external data using Python functions.

For more guidance, see Choosing field types.

Custom function fields

Commercial & Enterprise

The custom function field type lets you use a Python function to compute values or import data to your project schema.

For example, you might use a custom function to calculate total invoice amount using existing subtotal and tax rate fields:

1subtotal = float(subtotal)
2tax_rate = float(tax_rate) / 100
3tax_amount = subtotal * tax_rate
4total_amount = subtotal + tax_amount
5
6return round(total_amount, 2)

Custom function fields accept these parameters:

ParameterRequired?Description
contextRequiredStores metadata about the document.
context['document_text']OptionalRetrieves the entire text of the document.
context['file_path']OptionalRetrieves the path to the uploaded file.
keysOptionalAccess custom variables and organization secrets. Use keys['custom']['<key-name>'] for custom keys and keys['secret']['<key-name>'] for secret keys.
<additional-field-name>OptionalWhen writing custom functions in automation projects, click Add argument to select additional fields in the class to use in the function.

For additional guidance about custom functions, see Writing custom functions.

Viewing results across documents

To quickly scan or compare results, click the Results table icon Icon that looks like a 2x6 table. in the Documents header.

The results table corresponds to the current view in the editing panel, so the results you see change depending on your current task.

If the editing panel shows…Then the results table displays…
ClassesFinal results for all fields, across all classes.
Field editorFinal result and, if applicable, confidence threshold validation result for the selected field.
Validations with no rule selectedValidation results for all fields, across all classes.
Validations with a rule selected
or
Validation editor
Validation result for the selected rule and the result of any fields used to calculate it.

Reordering fields

To change the order of fields in the field editor, use the up and down arrows that display when you hover over a field.

Reordering fields can be necessary when creating derived fields, which can reference fields that precede it in the field editor. Additionally, reordering fields can be helpful to speed up reviews or support downstream integrations, because fields are displayed in processed results in the same order as in the field editor.

If you have derived fields or custom functions in your project that reference preceding fields, be aware that reordering fields can break the reference.

Hiding fields

Commercial & Enterprise

Hiding intermediate or computational fields can help simplify human review and downstream integration output.

Consider hiding fields that are used exclusively as input for derived fields or custom functions. For example, you might extract individual date components in separate hidden fields, then combine them into a final formatted date field that reviewers and downstream systems actually need.

To mark a field as hidden, open the field editor and enable Hide field.

Hidden fields can’t have validation rules, because validations on hidden fields could create confusing review scenarios. If you hide a field with an active validation rule, the rule is removed. If you later unhide the same field, any previous validation rules are restored.

Hidden fields use processing resources and count toward field limits, but their visibility varies across different AI Hub interfaces:

InterfaceHidden field behavior
App run results (UI)Hidden by default, can be unhidden with human review field filters
App run results (exported)Unhidden
Accuracy testsHidden by default, can be unhidden via test configuration
Human reviewHidden by default, can be unhidden with human review field filters
Deployment run results (exported)Unhidden
Downstream integrationsHidden by default, can be unhidden via deployment configuration
API & SDK resultsHidden by default, can be unhidden via deployment configuration
Deployment metricsHidden by default, can be unhidden via deployment configuration