ABBYY FineReader Engine
The most comprehensive OCR SDK for software developers
Integrate AI-powered OCR features into your applications
Document classification using Machine Learning
ABBYY FineReader Engine provides an API for document classification, allowing you to create applications, which automatically categorize documents and sort them into predefined document classes. The advanced document classification leverages modern technologies such as machine learning. These technologies are able to detect even subtle differences among individual document categories and allow setting up flexible and scalable classification processes that can granularly distinguish among many document categories. The new intelligent Image Classifier is able to collect and process visual information about document images and delivers fast classification results. The advanced Text Classifier is able to extract and process information about the documents’ content, which increases the classification accuracy. The Image Classifier and the Text Classifier can be used individually, or in combination.
How does it work?
In principle, the classification process consists of three steps:
Preparing data sets for classification training
At this step, the requested document classes are defined. For each document class, several document examples - with similar appearance and/or content - are selected. With the help of Machine Learning algorithms, the ABBYY technology analyzes the training documents within each document class and defines parameters that should be used to identify the respective document class.
Training the Classification Model
During this step, information about document classes and respective parameters is imported into the Classification Model and the Classification Model is trained. The model can use Image Classifier, Text Classifier or a combination of both. The performance can be optimized by defining the balance between high recall and high precision. Cross validation of data is available to test the quality of the Classification Model.
During the classification process, the Classification Model analyzes each incoming document. To correctly determine the document type, the Classification Model calculates requested parameters for each document and compares them with the information it received during the training step. Developers can create a routine, which allows users to flexibly update the training data set and re-train the Classification Model.
Depending on the usage scenario, the classification can be optimized for high precision, high recall or a balance between these.
High precision mode
This mode is recommended in scenarios, where it is important to precisely classify documents into the right categories and limit wrong class assignment to a minimum. Documents identified as belonging to the class A, should really belong to the class A and not to the class B, while it is acceptable that ‘uncertain’ documents belonging class A would not be classified as such and might be left out. Key focus: Precisely categorize documents and limit the risk of assigning documents to wrong document classes.
High recall mode
This mode is recommended in scenarios, in which it is important to detect all documents belonging into a certain category among all available documents, and limit the risk that they might be missed. Key focus: Within a document batch, detect all documents belonging to a certain class and limit the risk of leaving them out.