File limitations and processing

AI Hub enforces these maximum resource limits for files.

  • Single file: 50 MB or 800 pages
  • Total single-upload size: 100 MB

  • Conversations: About 500 files or 4,000,000 tokens

    In conversations, the total upload limit is based on a token limit of 4 million. The token limit might be encountered with fewer than 500 files. 500 files is an approximation of the token limit and enforced in the user interface only. When creating conversation by API, you can add any number of files, up to the token limit.
  • Automation projects: 500 files

  • Automation app runs: 1,000 files

Supported file types

These file types are supported for import.

.bat, .bashc, .c, .cc, .chtml, .cmake, .cmd, .cpp, .cs, .css, .csv, .cxx, .cy, .dockerfile, .doc, .docx, .eml, .gdoc, .go, .gsheet, .gslides, .h++, .hpp, .html, .java, .jpeg, .jpg, .js, .json, .mht, .mhtml, .mkfile, .msg, .pdf, .perl, .php, .plsql, .png, .pptx, .py, .pxi, .pyx, .r, .rd, .rs, .rtf, .ruby, .tif, .tiff, .ts, .txt, .xls, .xlsx, .xml, .yaml, .yml, .zsh

When browsing connected Google Drives, native Google file types (.gdoc, .gsheet, .gslides) are displayed in the file explorer but the files are converted to PDF when imported.

In commercial and enterprise automation projects with file splitting enabled, multipage files can include multiple documents. For best results in all other projects and conversations, use one file for each document.

App run results can be exported in CSV or Excel format.

For images, higher resolution provides better digitization results. Aim for a minimum scanning or capture resolution of 300 DPI.

Digitization details

When you upload files to AI Hub, the default digitization process includes these steps.

Modifying digitization settings for your conversation or project can impact these steps.
  • Email attachments are separated and treated as individual files. Inline images are treated as part of the email body.

  • Google Drive files are converted to PDF.

  • PDF layers are flattened to include all text and image elements.

  • Optical character recognition (OCR) is performed on both typed and handwritten text.

  • Page rotation, skew, and warp are corrected.

  • Signatures, checkboxes, and barcodes—both numeric and non-numeric formats—are detected, and appropriate markers are added to the text space.

Spreadsheet limitations

Excel spreadsheets and CSV files are subject to these limitations.

Excel files are processed in their native format, unless you disable Process spreadsheets natively in digitization settings for your conversation or project. Native processing offers better results for wide tables, but doesn’t support embedded objects, such as charts, or source highlighting in results.

Upload limitations

  • Files must be less than 10 MB.

  • Files can contain one large table up to 400 columns. Excel files can contain multiple small- to medium-sized tables on one sheet (totaling 200 rows and 30 columns).

Extraction limitations

Total extracted results are limited to 80,000 cells, for example:

  • If extracting 400 columns, you can retrieve up to 200 rows (400 × 200 = 80,000).

  • If extracting 10 columns, you can retrieve up to 8,000 rows (10 × 8,000 = 80,000).

You can adjust the number of columns and rows as needed within the 80,000 cell limit.

Unsupported features

  • Advanced features such as macros and data validation.

  • Triangular or nested tables.

  • Tables with multi-row or frozen headers.

  • Tables with empty rows or columns.

Token limits

When uploading files, some areas of AI Hub enforce a file-based upload limit while others enforce a total token upload limit. Tokens are the fundamental unit by which LLMs process text. Each token represents a piece of text, such as a whole word, part of a word, or a character. This means the density of information in an uploaded document affects the number of tokens required to encode that information. For example, 500 sparsely populated documents might be encoded in fewer tokens than 50 densely populated documents.

Factors that affect the number of tokens required to encode a document include language and complexity of content. As a guideline, for English text, one token encodes approximately four characters.