The automated document analysis step is a key part of the overall document recognition process. To conduct this step with a high precision, ABBYY FineReader Engine uses many advanced algorithms of artificial intelligence based methods and leverages the ABBYY Document Recognition Technology (ADRT).
During the document analysis step, the document is analyzed in regards to its logical structure – first and last document pages are identified, the formatting elements such as footnotes, headers, footers and table of content are detected.
At the same time, the layout of each individual page is detected and each page is divided into individual objects, such as text blocks, pictures, tables and table cells, barcodes, and separators. Additionally, the document analysis algorithms detect page orientation, identifies double pages, detects vertical text and defines page areas that are not relevant for the OCR process.
As a result, the ABBYY FineReader Engine is able to specify text areas and fields that should be recognized and page areas, such as images or diagrams, that should be kept in their original form. At the same time, it receives information about the logical document structure - including its formatting - which will be used at the end of the OCR process, when, the document will be exactly reconstructed.
It is as well possible to select the recognition area manually, by exactly specifying the field’s coordinates and type of data inside. Manual specification of the recognition areas is frequently used for precise extraction of data in predefined areas – so called Field-Level Recognition.
ABBYY FineReader Engine provides 3 automatic and 1 manual mode of document analysis:
This is a default document analysis type that searches all objects: text blocks, pictures, tables, barcodes and separators. The results of this analysis are used for document structure and layout retrieval if documents are processed for further reuse and the document needs be exactly reconstructed. All pictures and diagrams will be preserved in original form - without recognizing the text inside pictures or logos.
This is a document analysis type for converting semi-structured documents, such as invoices, bills, waybills, business cards, agreements, health claim forms, etc. It accurately locates all text on documents, including characters and numbers — even if this information is located within stamps, pictures, logos or small-text areas.
Unlike the general document analysis, this mode assumes that all printed information on documents is relevant. It ensures that important text information is not by mistake identified as graphic elements – for example, when the text is part of a company logo or a stamp. As a result, maximum information about the text and its coordinates is available for other systems conducting steps such as analysis, field-by-field processing and parsing.
This type of document analysis automatically detects and recognizes all text on documents - including text embedded in pictures, charts, and diagrams. Developers can use this mode to extract exhaustive full-text information that can be used for index building in DMS, CMS, ERP and archiving systems.
The text recognition areas can be set up manually. In this case, the relevant recognition field is directly defined and the automated document analysis is not necessary. During the later recognition step, the recognizer receives information about the coordinates and properties of the requested fields and applies OCR only on the specified zone.