Simulated classification

When building an app, the easiest way to classify documents is to define classes and let the app assign documents to classes based on class names and document contents.

For many documents—especially ones that are short or highly structured—this approach works fine.

For example, 1040 tax forms and 1095-C tax forms are short, highly structured, and quite different from each other. Defining classes called 1040 and 1095-C is probably all you need to classify those documents reliably.

But traditional classification doesn’t always work reliably with long or unstructured documents. In that case, you can try the alternative classification technique described in this tutorial.

Configuring simulated classification

You can simulate classification using the value in a special field you define in a class schema.

For example, instead of separate soup and salad classes, you could use a food type field in the default class with values soup or salad.

This approach is called simulated classification because all the documents still technically belong to the default class. As far as the app is concerned, those recipes aren’t classified at all! But whatever workflow the app feeds into can process or classify the recipes according to the value in each document’s food type field.

To use a different example, imagine an app that classifies novels based on where their action takes place: either Russia or the United States. You define Russia and United States classes, but discover that the app’s classification accuracy isn’t very high. Adding text to the Description field of each class helps, but not enough. What else can you try?

How about using a field to simulate classification?

First, get rid of the classification scheme that’s already set up, by deleting the Russia and United States classes. Now you’re down to just the Default class.

Next, define a field called Location in the Default class. To limit the field’s values to either Russia or United States, add a field description: What country is this novel set in? Return only "Russia" or "United States".

To test your simulated classification, hit Run all, click the grid icon, and check that the field extracts the right value, in the right format, for all input documents. Success!

This field description gives the desired values for the Location field, but it took many rounds of experimentation and prompt editing to get the field working correctly. Don’t be discouraged if your first few attempts at defining a field don’t work out quite right.

Use cases and limitations

If the only purpose of your app is to sort documents into classes, simulated classification works fine. You’re likely to integrate the app with a downstream document processing workflow, so the app’s output becomes the input to the next step in the flow. Whatever processing happens downstream can use the simulated classification by processing documents differently, based on what value they have in their Location field.

Although simulated classification can, under some circumstances, play the same role as traditional classification, it’s not an exact equivalent. With traditional classification, you can add different fields to each class. For example, a Passport class could have a Country field, whereas a US Driver’s License class might have a State field. Because simulated classification puts all documents in the same Default class, apps that use simulated classification extract the same fields for all documents.

Simulated classification might not be appropriate for all apps, but it’s a great technique to keep in mind for special use cases.

Was this page helpful?