Testing apps | Instabase AI Hub Documentation

Commercial & Enterprise

Accuracy testing compares the results of an app run to verified values for a set of documents. By comparing actual results to verified values, you can see how accurate your app is and where you might want to make improvements.

Accuracy testing overview

Accuracy testing can help you refine apps to ensure they meet satisfactory thresholds for validity and accuracy.

Follow these high-level steps to implement accuracy testing.

Develop or iterate on an app by creating an app or adding an app version.

To enable other organization members to test or deploy the app, set version visibility to Shared.
Create ground truth datasets or update existing datasets associated with your app.

Verify ground truth values for any new or updated datasets.

You can run accuracy tests against outdated datasets, but doing so typically lowers accuracy metrics, because results aren’t aligned with existing ground truth values.
Conduct accuracy testing on the new app version.

View or compare accuracy tests and examine error patterns to identify areas for improvement.
Repeat the previous steps as needed, using accuracy test results to guide incremental improvements to your app.

When you’re satisfied with accuracy test results, your app is ready for production use.

Return to this process whenever your document processing needs change. Regularly tracking accuracy metrics over time helps ensure your app continually meets accuracy thresholds.

Managing ground truth datasets

Ground truth datasets are sets of files and associated ground truth values that you use to test app accuracy.

Ground truth datasets are associated with a specific app. For a 1:1 comparison, you use the same set of documents to create the ground truth dataset and to test the app. You can create multiple ground truth datasets for an app to test different batches of input documents.

Several tasks are associated with managing ground truth datasets:

Creating a dataset establishes the set of files used for the dataset and runs these files against your app to generate results.
Updating a dataset reruns the dataset files against a new app version to generate new results.
Verifying ground truth values uses human review to confirm or correct dataset values.
Configuring dataset parameters lets you tune rules for comparing ground truth values to run results during accuracy testing.

Ground truth datasets have statuses that indicate readiness for accuracy testing.

Ready to use indicates that the dataset aligns with the corresponding project and the ground truth values are verified.
Review required indicates that ground truth values haven’t been verified for the dataset. Verify ground truth values.
Outdated indicates that the underlying app schema has changed, or for ground truth datasets based on project files, more files were added to the project. Ideally, update your dataset and verify ground truth values.

Creating ground truth datasets

Create a ground truth dataset to establish a new set of files to use for accuracy testing.

You can create ground truth datasets using project files, a previous app run, or files you upload. As a best practice, use project files for initial testing and iteration, then test with a separate dataset of new documents. This approach first verifies your production app’s consistency with your automation project, then assesses how effectively the app processes new data.

From the Hub, open the app you want to test.
Select the Accuracy tests tab.
Click Create dataset.
Specify details about your ground truth dataset, then click Next.
- Name — If you’re uploading files, specify a unique name for the ground truth dataset. If you’re using project files, the dataset name is fixed as Project files. You can have only one project files dataset per app.
- Workspace — Select the workspace where users can run and review accuracy tests for this dataset.
- File source — Select whether to use project files, files uploaded from a connected or local drive, or files from a recently completed app run.
If you’re uploading files, select files.

If you’re using project files or files from an app run, all files are automatically included in the dataset. You can’t modify the file list in this step.
Click Run app or, if using a previous app run, Create dataset.
- If you’re using project files or uploaded files, an app run begins. When the run completes, click Review dataset to verify ground truth values in human review.
- If you’re using a previous app run, the dataset is generated using existing results as ground truth values. You can modify ground truth values if needed.

Updating ground truth datasets

When you edit a project schema or add new project files, update associated ground truth datasets to ensure ground truth values are aligned with the new app version.

You can run accuracy tests against outdated datasets, but doing so typically lowers accuracy metrics, because results aren’t aligned with existing ground truth values.

When you update an outdated dataset, you must verify results for net new files, classes, and fields. Existing ground truth values are preserved.

From the Hub, open the app that you modified.
Select the Accuracy tests tab.
In the Ground truth datasets table, hover over a dataset with the Outdated status and click the update dataset icon .
Click Update dataset to run your the app against the dataset files and generate new results.

When the run completes, click Review dataset to verify ground truth values in human review.

Verifying ground truth values

After running files in a dataset through your app, verify results in human review. Verified results become the ground truth values for the dataset. Ground truth values reflect your desired end state results, including formatting.

From an app’s Accuracy tests tab, in the Ground truth datasets table, hover over a dataset with the Review required status and click the edit icon .

All files in the dataset open in human review so you can verify results.
For each file you’re reviewing, verify and correct data if needed, then mark the file as reviewed.
- To correct mapping data where multipage files were incorrectly parsed into individual documents, select the documents grid. Select pages and use the button controls to move or delete pages or create additional documents.
- To correct classification data, use one of these methods:
  - In the fields list, click the Edit classification icon near the assigned class. Select the correct class, then click Confirm.
  - In the documents grid, click the class name. Select the correct class, then click Confirm.
    When you change document classification, you can specify how the schema for the new class is applied. By default, the app reprocesses the document to identify field results for the new class. Reprocessing incurs usage charges at the same rates as regular app runs. To apply the schema for the new class without reprocessing, deselect Extract fields for the updated class.
- To correct text fields, in the fields list, select a field. Enter a new value or, in the document viewer, select the area of the document that contains the information for that field. You can click to select text, or use your mouse to draw a box around the information.
  If validations apply to the selected document, all fields are automatically revalidated when you modify a value.
- To correct tables, in the fields list, select a table to open it for editing. Select any table cell, then in the document viewer, select the area of the document that contains the information for that cell. You can click to select text, or use your mouse to draw a box around the information.
Review any additional files in the file list.
Click Finish review.

This action marks everything in the dataset as reviewed and advances you to the dataset overview where you can run an accuracy test.

Modifying ground truth values

You can view ground truth values by opening a dataset and selecting the Ground truth values tab.

To see the values full-screen, hover over any cell and click the maximize icon . The full-screen comparison is helpful to review tables and longer results.

If you need to modify any values, click Edit values to open the dataset in human review. Follow the steps for verifying ground truth values to make any changes.

Configuring dataset parameters

Tune how ground truth values are compared to run results during accuracy testing by configuring parameters for each ground truth dataset. You can configure global rules that apply across all fields, as well as field rules, which can override global rules for specific fields.

Dataset parameters are available on the Configuration tab of ground truth datasets, and in the accuracy testing workflow.

Configuration changes applied during accuracy testing are saved to the underlying ground truth dataset.

Global rules

Global rules apply across all fields in a dataset, unless overridden by field rules.

Document rules

Exempt misclassified documents — Excludes incorrectly classified documents from extraction accuracy calculations. This option prevents skewed field-level accuracy calculations due to misclassified documents.

Text rules

Ignore whitespace — Disregards spaces, tabs, and line breaks when comparing values.
Ignore special characters — Excludes non-alphanumeric characters from comparisons.
Ignore casing — Treats uppercase and lowercase letters as equivalent.
Allow Levenshtein distance — Permits a maximum number of character differences between strings. For example, a Maximum difference of 1 treats cat and bat as a match because they differ by 1 character.

Integer rules

Use European number format — Interprets thousand separators using European conventions, where periods separate thousands. For example, “1.234” is interpreted as one thousand two hundred thirty-four, not as a decimal value.
Allow integer error tolerance — Accepts integer values that fall within a specified range of the expected value. For example, a maximum difference of 2 would treat 1234 and 1236 as a match.

Decimal rules

Use European number format — Interprets decimal separators using European conventions, where commas are used as decimal points. For example, “12,34” is interpreted as twelve and thirty-four hundredths.
Allow decimal error tolerance — Accepts numerical values that fall within a specified range of the expected value. For example, a tolerance of 0.01 would treat 12.34 and 12.35 as a match.
Round — Rounds decimal values up, down, or to the nearest whole number before comparison.

Field rules

For each field, you can specify the expected result type, which determines how extracted values are interpreted and validated. By default, global rules are applied to fields based on the selected result type. To specify different rules for text, integer, and decimal field types, enable Use custom rules for this field.

For structured result types (lists and tables), the content within lines or cells is always compared using global text rules.

Result types roughly correlate to field types in automation projects, with additional options for more precise handling. Select the result type that matches how you want to compare extracted data.

Use text comparison for exact string matching, including formatting.
Use integer or decimal for numeric results that need mathematical comparison.
Use structured types, like lists or table, to compare data content while ignoring formatting differences. Use the text type if formatting must match exactly.

Result type	Equivalent project field type	Use for…
Text	Text extraction	Character data, including numbers that shouldn’t be calculated or compared numerically, like ID numbers, phone numbers, or zip codes.
Integer	Text extraction	Whole numbers that need numerical comparison, like quantities, counts, or years.
Decimal	Text extraction	Numbers that might include decimal places, like prices, measurements, or percentages.
List of text	List extraction without attributes	Lists of single values, like item names in an invoice.
List of objects	List extraction with attributes	Lists where each item contains multiple fields, like line items containing both quantity and price.
Table	Table extraction	Grid data with consistent columns, like detailed transaction tables.

By default, all fields are included in accuracy tests, but you can optionally exclude fields by deselecting Include in accuracy tests. Excluded fields aren’t reported in accuracy tests or counted in accuracy metrics.

Testing accuracy

Accuracy tests compare run results against ground truth values to measure performance and identify areas for improvement.

Before you begin

You must have one or more ground truth datasets for the app you want to test.

From the Hub, open the app you want to test.
Select the Accuracy tests tab.
Click Test accuracy.
Select the app version and ground truth dataset you want to use for testing, then click Next.
In the Test accuracy window, click Run test.

An app run begins using files in the ground truth dataset you selected. Run results are compared to the ground truth values to generate accuracy metrics.

Viewing accuracy tests

The Accuracy tests tab for each app displays a list of accuracy tests run against all app versions. The page summarizes key accuracy metrics for each test.

To see a full accuracy report, click any completed accuracy test.

Accuracy reports include the testing metadata and key metrics shown on the accuracy test page. The report also includes a snapshot of the associated ground truth dataset at the time the accuracy test ran, providing visibility into the exact data used for the test.

Project overview metrics provide a high-level summary of classification and extraction validity and accuracy. Similarly, the Summary by class section reports classification and extraction validity and accuracy by class.

Extraction details breaks down accuracy metrics for individual fields within each class. Use this section to gain a better understanding of project and class summary metrics. You can expand the Ground truth table to drill into results for each field. The ground truth table reports classification and extraction results in each table cell. Validation results are indicated with icons and ground truth values are displayed when they differ from the run results.

One row of the ground truth table, showing a variety of results.

Result failed validation, indicated by a red error icon.
Result passed validation, indicated by a green checkmark.
Result didn’t match ground truth value, indicated in red strikethrough text with the ground truth value reported below.
Result matched ground truth value, indicated in standard text.

To see a full-screen comparison of the ground truth value and the run result, hover over the result in any cell and click the view icon . This full-screen comparison is helpful to review tables and longer-form results. From this view, you can also set the run result as the new ground truth value if needed.

Comparing accuracy tests

Compare two accuracy tests to see how results changed between app versions or AI runtime versions.

To compare accuracy tests, click Compare tests then select two tests to compare. For an accurate comparison, select tests that use the same ground truth dataset.

Accuracy comparison reports include overview metrics as well as results for each class and field. Trend arrows and percentages indicate the direction and magnitude of change between the two tests.

Ensure you’re comparing in the intended direction. The newer accuracy test should appear first to indicate whether results have improved. Click Switch to reverse the comparison order if needed.

Accuracy metrics

Accuracy metrics measure how valid and how accurate app results are.

In projects without classification, classification metrics are reported at 100% because all documents are effectively classified in the default class.

Automation rate indicates the percent of classes or fields that are processed without human intervention, either because they pass validation or have no validation rules. While higher automation rates generally indicate efficiency, they can be misleading in apps with no or few validations.

Validated accuracy indicates the percent of automated classes or fields that match the ground truth dataset. A higher validated accuracy means that results are both valid, as measured by validation rules, and accurate, as measured by ground truth values.

Raw accuracy indicates the percent of classes or fields that match corresponding ground truth values. This metric measures classification or extraction accuracy only, without factoring in validations. Raw accuracy is particularly helpful when it differs significantly in either direction from automation rate. For example, if automation rate is low but raw accuracy is high, this suggests accurate run results but overly strict validation rules.