Cleanup

You now know how to use the AI Hub SDK to automate document processing workflows. This use case generates digital entities in AI Hub: input documents, batches, result files, and others. After your app run completes, all those entities hang around the AI Hub filesystem.

This use case shows you how to extend the programs you’ve already written so they clean up after themselves by deleting the files and data they generate. This strategy is helpful for complying with data retention policies and for general digital hygiene.

Add two cleanup steps to the program you wrote for the automate use case.

  • Delete the batch of input documents.

  • Delete the app run.

Cleaning up after the automate use case

To declutter the AI Hub filesystem after running an app, either delete individual files or delete the entire app run. In this section of the tutorial, start with the first approach by deleting just the batch of input documents. Then use the second approach by deleting whatever else was produced by the app run.

Deleting the batch

Think back to the program you wrote to run the Meal Receipt app. That program uploaded a batch of files with documents for the app to process. Unless you want to process that same batch with a different app later, delete the batch (including all the files within it) after Meal Receipt sends back results.

Add this code to the bottom of your existing automate_with_sdk.py file.

1print(f"deleting batch with ID {batch_id}")
2
3# submit an asynchronous request to delete the whole batch
4delete_batch_resp = client.batches.delete(batch_id)
5
6# repeatedly check status of delete batch job
7while True:
8 # pause to let the delete batch job make progress
9 time.sleep(3)
10 # check status of the delete batch job
11 status_resp = client.jobs.status(delete_batch_resp.job_id)
12 print(f"delete batch status: {status_resp.state}")
13 # break out of the loop if the delete batch job no longer running
14 if status_resp.state not in ["RUNNING", "PENDING"]:
15 break

Deleting a batch is an asynchronous operation: call client.batches.delete() to ask for a batch to be deleted, then check the status of the delete job by calling client.jobs.status().

You might be surprised that the ID for the delete batch job is stored in a field called job_id, while other IDs you’ve dealt with are stored in fields called id. This brings up two important points.

  • Like most SDKs, the AI Hub SDK has some naming inconsistencies.
  • To learn what fields are available on an SDK method’s response, refer to the documentation for the appropriate method.

Run the modified automate_with_sdk.py to see new output showing that the batch is deleted as soon as Meal Receipt is done processing its documents.

Deleting the app run

After retrieving the Meal Receipt results, clean up the app run data with client.apps.runs.delete(). This removes the output, logs, and database records.

Add this code below the delete batch code.

1print(f"deleting app run with ID {run_resp.id}")
2delete_run_resp = client.apps.runs.delete(run_resp.id)
3
4# repeatedly check status of the job that deletes the output directory
5while True:
6 time.sleep(3)
7 status_resp = client.jobs.status(delete_run_resp.delete_output_dir_job_id)
8 print(f"delete output dir status: {status_resp.state}")
9 if status_resp.state not in ["RUNNING", "PENDING"]:
10 break

This operation is similar to the batch deletion you used, except it returns data with a delete_output_dir_job_id field. The ID in this field lets you check the progress of the delete job.

You know that the client.apps.runs.delete() method deletes an app run’s output directory, but it deletes three other kinds of entities as well. Look at the method’s documentation and see if you can figure out what else it deletes. Hint: remember the code you added to delete the batch? Well, it wasn’t strictly necessary. (But it was a good learning exercise!)

It turns out the method also deletes the batch (which the documentation refers to as the app run’s input directory), any log files, and associated DB data.

To keep this sample code short, check the status of only the job that deletes the output directory. If you want to be more thorough, include separate loops to check the status of jobs that delete the batch and logs as well.

Unlike the other types of entities, DB data is deleted synchronously, so there’s no need to check its deletion status. That’s why there’s no delete_db_data_job_id field on the response from client.apps.runs.delete().
Running an app through a deployment instead of running it directly lets you configure the deployment’s data retention settings to handle cleanup automatically.

Confirming the automate cleanup

If you’ve been adding code to automate_with_sdk.py as it’s presented, your program deletes the batch and then deletes the entire app run. Confirm that the second step works by comparing the number of app runs that exist before and after the cleanup, using an SDK method called client.apps.runs.list().

Paste the snippet below into automate_with_sdk.py in two separate places.

  • Immediately before you delete the batch (line 82 in the complete program below).

  • Immediately after you delete the batch (line 106 in the complete program below).

1# print number of app runs
2list_runs_resp = client.apps.runs.list()
3num_app_runs = len(list_runs_resp.runs)
4print(f"number of app runs: {num_app_runs}")

Run automate_with_sdk.py. If the output shows that cleanup reduces the app run count by one, your cleanup logic works!

Here’s the complete automate_with_sdk.py program, with all cleanup steps added.

Complete automate_with_sdk.py with cleanup
1# prepare to use standard Python libraries
2import sys
3import time
4
5# prepare to use the SDK
6# and an exception that the SDK throws when authorization fails
7from aihub import AIHub
8from aihub.exceptions import UnauthorizedException
9
10# authorize the SDK
11client = AIHub(api_root="PASTE YOUR API ROOT HERE",
12 api_key="PASTE YOUR API KEY HERE",
13 ib_context="PASTE YOUR IB-CONTEXT HERE")
14
15print("creating an empty batch")
16try:
17 # make an empty batch with a specific name in a specific workspace
18 create_batch_resp = client.batches.create(
19 name="receipt batch",
20 workspace="SDK-Tutorial")
21except UnauthorizedException:
22 # exit the program while printing a user-friendly error message and
23 # instructions on how to fix the problem
24 sys.exit("ERROR: SDK not authorized. "
25 "Are the API ROOT, API KEY, and IB-Context values correct?")
26
27# store batch_id in an easy-to-read variable, since we'll use it several times
28batch_id = create_batch_resp.id
29
30print("uploading two files to the batch")
31
32# upload the first file to the batch
33client.batches.add_file(batch_id=batch_id,
34 file_name="receipt-a.jpg",
35 file=open("PATH/ON/YOUR/COMPUTER/TO/receipt-1.jpg", "rb"))
36
37# upload a second file to the batch
38client.batches.add_file(batch_id=batch_id,
39 file_name="receipt-b.jpg",
40 file=open("PATH/ON/YOUR/COMPUTER/TO/receipt-2.jpg", "rb"))
41
42print("running the app")
43# trigger an app run, specifying which app, who wrote it, and which batch it should process
44run_resp = client.apps.runs.create(app_name="Meal Receipt",
45 owner="Instabase",
46 batch_id=batch_id)
47
48print("checking the app status until it finishes")
49while True: # loop until explicitly told to leave the loop
50 time.sleep(3) # pause a few seconds between each app status check
51 status_resp = client.apps.runs.status(run_resp.id) # get the app status
52 print(f"status: {status_resp.status}") # update the user on the app status
53 if status_resp.status not in ["PENDING", "RUNNING"]: # these statuses mean the app is still running
54 break # the app is done, so stop looping
55
56print("fetching the app results")
57results_resp = client.apps.runs.results(run_resp.id) # get the app results
58
59for file in results_resp.files: # iterate across all processed files
60 print(f"file name: {file.original_file_name}")
61 for document in file.documents: # iterate across all documents in a file
62 for field in document.fields: # iterate across all fields in a document
63 print(f"{field.field_name}: {field.value}") # print the field name and value
64 print("---") # visual separator between files
65
66print(f"deleting batch with ID {batch_id}")
67
68# submit an asynchronous request to delete the whole batch
69delete_batch_resp = client.batches.delete(batch_id)
70
71# repeatedly check status of delete batch job
72while True:
73 # pause to let the delete batch job make progress
74 time.sleep(3)
75 # check status of the delete batch job
76 status_resp = client.jobs.status(delete_batch_resp.job_id)
77 print(f"delete batch status: {status_resp.state}")
78 # break out of the loop if the delete batch job no longer running
79 if status_resp.state not in ["RUNNING", "PENDING"]:
80 break
81
82# print number of app runs
83list_runs_resp = client.apps.runs.list()
84num_app_runs = len(list_runs_resp.runs)
85print(f"number of app runs: {num_app_runs}")
86
87print(f"deleting app run with ID {run_resp.id}")
88delete_run_resp = client.apps.runs.delete(run_resp.id)
89
90# repeatedly check status of the job that deletes the output directory
91while True:
92 time.sleep(3)
93 status_resp = client.jobs.status(delete_run_resp.delete_output_dir_job_id)
94 print(f"delete output dir status: {status_resp.state}")
95 if status_resp.state not in ["RUNNING", "PENDING"]:
96 break
97
98# print number of app runs
99list_runs_resp = client.apps.runs.list()
100num_app_runs = len(list_runs_resp.runs)
101print(f"number of app runs: {num_app_runs}")

Cleanup conclusion

You’ve covered all the cleanup that’s necessary for this use case. You might be surprised at how much longer the automate_with_sdk.py program is after cleanup logic is added. As with exception handling, cleaning up after yourself is an important (if tedious) task for responsible programmers. Remember to leave plenty of time to add similar logic to your own SDK-enabled programs.

By adding new features to programs written earlier, you’ve experienced the common task of returning to code that you thought was complete but that now needs to be maintained. This task is easier when you’ve added thorough comments—such as you see in the complete examples here—to provide guideposts. It’s amazing how quickly uncommented code turns cryptic when you step away from it for a while, even when you were the original author.

The last page of this tutorial has a recap of what you’ve learned and guidance on where to go next on your AI Hub SDK journey.