ABBYY FineReader Engine 10 for Windows > OCR Stages

Document Analysis

 

Basic document analysis features

Document Analysis is a set of functions for automatic detection of the following objects on a page:

  • Text blocks 
  • Pictures 
  • Tables and table cells 
  • Barcodes 
  • Separators 

Additionally document analysis provides some special features to prepare image for OCR:

  • process detection of page orientation – 90, 180, and 270 degrees (see Image Processing)  
  • split double pages 
  • process vertical text detection in table cells 
  • detect and mark the blocks of garbage on page

This preparation is significantly important to specify which fields on page should be recognized and what should be kept in initial form.
And also there is an ability to designate the field for recognition manually. In this case you have to set field’s coordinates and type of data inside. It is used in Field-Level Recognition scenario mostly for data capture.
ABBYY FineReader Engine 10 provides 3 automatic and 1 manual type of document analysis:

  • General document analysis
  • Document analysis for invoices 
  • Document analysis for full-text indexing 
  • Manual blocks specification for field-level recognition 

 

General document analysis

This is default document analysis type which searches all objects: text blocks, pictures, tables, barcodes and separators. The results of this analysis are used for document structure and layout retrieval in content reuse scenario. All pictures and diagrams are preserved in original form without recognizing text on them.

 

Document analysis for invoices

This is a preprocessing engine for converting semi-structured documents, such as invoices, payment drafts, bills, waybills, business cards, agreements, health claim forms, etc. It has been designed to accurately locate all the text on these documents, including characters and numbers — even if this information is located within stamps, pictures, logos or small-text areas.

Unlike the standard full-page document analysis, this one assumes that all printed information on documents is text. It also ensures that important text information is not identified as graphic elements and words or numerical values are not separated into multiple characters. As a result, maximum information about the text, including its coordinates, is available for analysis, field-by-field processing and parsing at subsequent processing stages by other systems.

 

Document analysis for full-text indexing

Automatically detects and recognizes all text on documents including text embedded in pictures, charts, and diagrams. Developers may choose to use this mode of document analysis to extract exhaustive full-text information on documents needed for document index building (as in DMS, CMS, Archiving systems).

 

Manual blocks specification for field-level recognition

This case does not need any analysis because the recognition field is directly defined by user or application. Recognizer receives the coordinates of field and type of text and process OCR in specified zone.

Learn more about full-text and field-level recognition >>

 


All OCR processing stages:

Image Import
Image Processing
Document Analysis
OCR and Other Recognition Technologies
Receiving and Exporting Recognized Text



Please enter your name and e-mail in the form below:
First Name:
Last Name:
E-mail:

*Your email address will be used to send information about the product purchase, news and updates only. Your email address will not be sold, rented or shared with other parties.