ABBYY FineReader Engine ABBYY FineReader Engine

The most comprehensive OCR SDK for software developers

Integrate AI-powered OCR features into your applications.

Document classification using Machine Learning and NLP

ABBYY FineReader Engine provides an API for document classification, allowing you to create applications, which automatically categorize documents and sort them into predefined document classes. The advanced document classification leverages modern technologies such as machine learning and natural language processing. These technologies are able to detect even subtle differences among individual document categories and allow setting up flexible and scalable classification processes that can granularly distinguish among many document categories.

The new intelligent Image Classifier is able to collect and process visual information about document images and delivers fast classification results. The advanced Text Classifier is able to extract and process information about the documents’ content, which increases the classification accuracy. The Image Classifier and the Text Classifier can be used individually, or in combination.

How does it work?

In principal, the classification process consists of three steps:

  • 1

    Preparing data sets for classification training

    At this step, the requested document classes are defined. For each document class, several document examples - with similar appearance and/or content - are selected. With the help of Machine Learning and Natural Language Processing algorithms, the ABBYY technology analyzes the training documents within each document class and defines parameters that should be used to identify the respective document class.

  • 2

    Training the Classification Model

    During this step, information about document classes and respective parameters is imported into the Classification Model and the Classification Model is trained. The model can use Image Classifier, Text Classifier or a combination of both. The performance can be optimized by defining the balance between high recall and high precision. Cross validation of data is available to test the quality of the Classification Model.

  • 3

    Classification deployment

    During the classification process, the Classification Model analyzes each incoming document. To correctly determine the document type, the Classification Model calculates requested parameters for each document and compares them with the information it received during the training step. Developers can create a routine, which allows users to flexibly update the training data set and re-train the Classification Model.

In addition to the information about detected document categories, the information about the probability that documents belong to them is provided. The probability information can be used to determine next processing steps, such as forwarding documents to the relevant company departments or re-classifying them.

In ABBYY FineReader Engine’s documentation, the classification process is illustrated by a code sample, which can be used for testing, adjusted and integrated in own applications.

Classification modes

Depending on the usage scenario, the classification can be optimized for high precision, high recall or a balance between these.

  • High precision mode

    This mode is recommended in scenarios, where it is important to precisely classify documents into the right categories and limit wrong class assignment to a minimum.

    Documents identified as belonging to the class A, should really belong to the class A and not to the class B, while it is acceptable that ‘uncertain’ documents belonging class A would not be classified as such and might be left out.

    Key focus: Precisely categorize documents and limit the risk of assigning documents to wrong document classes.

  • High recall mode

    This mode is recommended in scenarios, in which it is important to detect all documents belonging into a certain category among all available documents, and limit the risk that they might be missed.

    The documents belonging into the class A should not stay undetected in the document batch, while it is acceptable that some of the documents classified as belonging to the class A may in reality belong to the class B.

    Key focus: Within a document batch, detect all documents belonging to a certain class and limit the risk of leaving them out.

Start benefiting from ABBYY FineReader Engine today

contact us