The automated document analysis step is a key part of the overall document recognition process. To conduct this step with a high precision, ABBYY FineReader Engine uses many advanced algorithms of artificial intelligence based methods.
During the document analysis step, the document is analyzed in regards to its logical structure – first and last document pages are identified, the formatting elements such as footnotes, headers, footers and table of content are detected.
At the same time, the layout of each individual page is detected and each page is divided into individual objects, such as text blocks, pictures, tables and table cells, barcodes, and separators. Additionally, the document analysis algorithms detect page orientation, identifies double pages, detects vertical text and define page areas that are not relevant for the OCR process.
As a result, the ABBYY FineReader Engine is able to specify text areas and fields that should be recognized and page areas, such as images or diagrams, that should be kept in their original form. At the same time, it receives information about the logical document structure (including its formatting) which will be used at the end of the OCR process, when the document will be exactly reconstructed.
The results of this analysis are used for document structure and layout retrieval if documents are processed for further reuse – which means that the documents need be exactly reconstructed. All pictures and diagrams will be preserved in their original form - without recognizing the text inside pictures or logos.
The text recognition areas can be set up manually. In this case, the relevant recognition field is directly defined and the automated document analysis is not necessary. During the later recognition step, the recognizer receives information about the coordinates and properties of the requested fields and applies OCR only to the specified zone.