Supported languages

Document digitization in AI Hub uses a variety of third-party OCR processors to achieve best results. Digitization settings are interdependent, so your digitization requirements—for example, print versus handwritten text—impact which languages are supported.

Default language set

By default, digitization is supported only for languages that use Latin characters (a, b, c…) in the Azure AI Vision Read container or Microsoft Read model, depending on your digitization settings and documents.

Common languages within these lists that use non-Latin characters—and thus aren’t supported by default—include Arabic, Bengali, Chinese Simplified, Chinese Traditional, Greek, Hebrew, Hindi, Japanese, Korean, Russian, Thai, and Urdu.

Standard non-Latin language set

Commercial & Enterprise

The standard non-Latin language set supports all languages in the Azure AI Vision Read container or Microsoft Read model, depending on your settings and documents.

Advanced non-Latin language set

Enterprise

The advanced non-Latin language set fully supports all languages in the Google Cloud Vision API.

Compared to the standard non-Latin language set, the advanced set adds supports for Armenian, Bengali, Greek, Gujarati, Hebrew, Kannada, Khmer, Lao, Latvian, Macedonian, Malayalam, Tagalog, Tamil, Telugu, Thai, Ukrainian, Vietnamese, and Yiddish.