Extracting data from documents
To process documents, you must specify which data points, or fields, you want to extract. If your project includes different document types, like a mix of passports and driverβs licenses, create a class for each document type and specify fields for each class.
You can create up to 250 classes per project, and up to 100 fields per class.
The classes and fields in your project form the project schema: the blueprint of information you want to extract from documents.
Autogenerating a project schema
Agent modeProjects in agent mode can use a model to autogenerate classes and fields based on uploaded documents.
Autogenerated schemas create up to 20 fields using basic field types like text extraction and list extraction. This approach works well for straightforward extraction needs or as a starting point for more complex schemas that you can refine.
Before you begin
You must have uploaded a set of files that represent the types of documents you want to process. Five or so files of each type is a good start.-
In the editing panel, click Autogenerate schema.
Commercial & Enterprise If your project includes multipage files, youβre prompted to optionally enable splitting files.
Configure file splitting based on your production processing requirements. For example, if you intend to process compiled PDFs, enable file splitting even if your project includes only separate files. You can manually enable or disable file splitting in project settings.It might take several minutes for the schema to generate.
You can use the schema as-is, modify it, or click the Undo schema icon to clear the autogenerated schema and manually generate your project schema.
Manually generating a project schema
To manually create your project schema, create classes for each document type in your project, if necessary, then create fields for the data points you want to extract. This approach gives you complete control over your schema.
Manual schema generation works across all processing modes and provides access to all field types.
Before you begin
You must have uploaded a set of files that represent the types of documents you want to process. Five or so files of each type is a good start.Creating classes
If your project includes different document types, start by creating a class for each document type. You can then specify a different set of fields for each class.
Organization members can import prebuilt classes from a library of established schemas, such as paystubs, invoices, bank statements, and utility bills.
In projects with classification, a default class called other is assigned to documents that canβt be classified. You canβt delete or modify this class.
-
In the editing panel, click the Create classes icon , then select one of these options based on your AI Hub subscription and project requirements:
-
Create classes β Lets you create a custom class without any fields. If you select this option, enter a succinct name for your document type, then click β to exit the class editing panel.
-
Browse prebuilt classes Commercial & Enterprise β Lets you add common document types and their associated fields based on a library of available schemas. If you select this option, choose the prebuilt classes that you want to add and click Add to project.
-
-
Use the Create classes icon to add more classes as needed.
-
When youβre done creating classes, click Classify documents.
Commercial & Enterprise If your project includes multipage files, youβre prompted to optionally enable splitting files. You can split by documentβwhich lets the model determine where document breaks occurβor split each pageβwhich creates a new document at every page break.
Configure file splitting based on your production processing requirements. For example, if you intend to process compiled PDFs, enable file splitting even if your project includes only separate files. You can manually enable or disable file splitting in project settings.Classes are assigned to your documents and documents are grouped by class in the document list. Any documents that canβt be classified are assigned the other class.
-
Verify classification. If documents werenβt classified as expected, edit classes to improve your results.
-
In a class that wasnβt identified accurately, click the overflow icon , then select Edit class.
-
Enter a description to help the model more accurately identify documents in the class, then click β to exit the class editing panel.
Effective descriptions include unique identifying details about a document class.
In projects that donβt use file splitting, you can reference file extensions to help classify documents. For example, the description for an Images class might be Files with image file extensions, like JPEG, PNG, and TIF.
As a best practice, limit class descriptions to 1,000 characters (4,000 maximum).
-
Use the overflow icon to edit more classes as needed.
-
When youβre done editing classes, click Classify documents.
-
π Visual tutorial: Creating classes
Classification function
EnterpriseClassification functions let you use a Python function to classify documents based on deterministic rules rather than relying on LLM classification.
When a classification function is enabled for a project, it replaces LLM classification entirely. All documents are classified using the custom function logic instead of AI inference.
Classification functions are useful when you need predictable, rule-based classification for documents that follow consistent patterns. Instead of relying on model inference, you can implement custom logic that identifies document types based on specific text markers.
For example, you might use a classification function to classify insurance forms based on specific identifiers:
Classification functions accept these parameters:
Classification functions must return a value from the class_name_enum that corresponds to an existing class in your project. If the function returns a class name that doesnβt exist, the document is assigned to the default βotherβ class.
For additional guidance about custom functions, see Writing custom functions.
Splitting function
EnterpriseSplitting functions let you use a Python function to split multipage files into documents using rule-based logic rather than relying on LLM inference.
When a splitting function is enabled for a project, it replaces LLM-based splitting entirely. All files are split using the custom function logic instead of AI inference.
Splitting functions are useful when you need predictable, rule-based splitting for files that follow consistent patterns. Instead of relying on model inference, you can implement custom logic that identifies document boundaries based on specific text markers, structural elements, visual objects, or a combination of criteria.
Splitting functions process the entire file and return a list of dictionaries that specify page ranges and class labels for each document split. Each dictionary must include:
class_labelβ The class name for the document splitpage_startβ The starting page number (0-indexed)page_endβ The ending page number (0-indexed, inclusive)
For example, you might use a splitting function to split files based on barcode separator pages:
Splitting functions accept these parameters:
Splitting functions must return a JSON-serialized list of dictionaries. Each dictionary represents one document split and must include class_label, page_start, and page_end keys. All class_label values must match existing class names in your project. Page ranges must not overlap and must cover all pages in the file.
You can use splitting functions independently or together with classification functions. When used together, the splitting function determines page ranges and assigns initial class labels, then the classification function processes each split document to refine or override the class assignment.
For additional guidance about custom functions, see Writing custom functions.
Creating fields
Create fields for each of the data points you want to identify.
-
In the editing panel, click Add field.
-
Enter a field name or select a suggested field name, then press Enter.
Data is extracted based on field name alone and the result is displayed.
-
Do one of the following, based on whether your result is accurate:
-
Accurate result β Click β to exit the field editing panel and continue adding fields.
-
Inaccurate result β Edit the field. When youβre done editing, click β to exit the field editing panel and continue adding fields.
-
π Visual tutorial: Creating fields
Editing fields
If a field doesnβt return the results you expect, you have options to fine-tune the results.
Access the field editor for an existing field by hovering over the field and clicking the edit icon .
In the field editor, first choose the field type appropriate for the data you want to identify.
With a suitable field type selected, if necessary, provide a more detailed description or prompt describing the information youβre looking for. As a best practice, keep field and attribute names under 48 characters and use a description or prompt for longer content up to 1,000 characters (4,000 maximum). For best practices, see Writing effective prompts.
Legacy mode
Legacy mode applies only to projects that havenβt been updated to agent mode. All new projects and apps are created in agent mode by default.
For most field types, you can change the model using the model selector dropdown.
-
Use the standard model for straightforward fields that perform basic text extraction or calculations. The standard model tends to perform best on shorter documents less than 50 pages. Its faster processing is suitable when speed is your priority.
-
Use the advanced model for specialized fields that perform multistep reasoning or complex math. The advanced model performs better on longer documents and those with challenging formatting, and itβs required for visual reasoning fields. Its more deliberate processing is suitable when accuracy is your priority.
For details about model capabilities, see Choosing a model.
When youβre done editing a field, click Run to see results and further refine your edits if needed.
π Visual tutorial: Editing fields
Field types
Choose the field type appropriate for the data you want to identify.
For more guidance, see Choosing field types.
Custom function fields
Commercial & EnterpriseThe custom function field type lets you use a Python function to compute values or import data to your project schema.
For example, you might use a custom function to calculate total invoice amount using existing subtotal and tax rate fields:
Custom function fields accept these parameters:
For additional guidance about custom functions, see Writing custom functions.
Viewing results across documents
To quickly scan or compare results, click the Results table icon in the Documents header.
The results table corresponds to the current view in the editing panel, so the results you see change depending on your current task.
Reordering fields
To change the order of fields in the field editor, use the up and down arrows that display when you hover over a field.
Reordering fields can be necessary when creating derived fields, which can reference fields that precede it in the field editor. Additionally, reordering fields can be helpful to speed up reviews or support downstream integrations, because fields are displayed in processed results in the same order as in the field editor.
Hiding fields
Commercial & EnterpriseHiding intermediate or computational fields can help simplify human review and downstream integration output.
Consider hiding fields that are used exclusively as input for derived fields or custom functions. For example, you might extract individual date components in separate hidden fields, then combine them into a final formatted date field that reviewers and downstream systems actually need.
To mark a field as hidden, open the field editor and enable Hide field.
Hidden fields canβt have validation rules, because validations on hidden fields could create confusing review scenarios. If you hide a field with an active validation rule, the rule is removed. If you later unhide the same field, any previous validation rules are restored.
Hidden fields use processing resources and count toward field limits, but their visibility varies across different AI Hub interfaces:

