Basic document analysis features
Document Analysis is a set of functions for automatic detection of the following objects on a page:
- Text blocks
- Tables and table cells
Additionally document analysis provides some special features to prepare image for OCR:
- process detection of page orientation — 90, 180, and 270 degrees
- split double pages
- process vertical text detection in table cells
- detect and mark the blocks of garbage on page
This preparation is significantly important to specify which fields on page should be recognized and what should be kept in initial form.
And also there is an ability to select the field for recognition manually. In this case, you have to set the field’s coordinates and type of data inside. It is used in Field-Level Recognition scenario mostly for data capture.
ABBYY FineReader Engine 12 provides 3 automatic and 1 manual types of document analysis:
- General document analysis
- Document analysis for invoices
- Document analysis for full-text indexing
- Manual blocks specification for field-level recognition
General document analysis
This is default document analysis type which searches all objects: text blocks, pictures, tables, barcodes and separators. The results of this analysis are used for document structure and layout retrieval in content reuse scenario. All pictures and diagrams are preserved in original form without recognizing text on them.
Document analysis for invoices
This is a preprocessing engine for converting semi-structured documents, such as invoices, payment drafts, bills, waybills, business cards, agreements, health claim forms, resumes, etc. It has been designed to accurately locate all the text on these documents, including characters and numbers — even if this information is located within stamps, pictures, logos or small-text areas.
Unlike the standard full-page document analysis, this one assumes that all printed information on documents is text. It also ensures that important text information is not identified as graphic elements and words or numerical values are not separated into multiple characters. As a result, maximum information about the text, including its coordinates, is available for analysis, field-by-field processing and parsing at subsequent processing stages by other systems.
Document analysis for full-text indexing
Automatically detects and recognizes all text on documents including text embedded in pictures, charts, and diagrams. Developers may choose to use this mode of document analysis to extract exhaustive full-text information on documents needed for document index building (as in DMS, CMS, Archiving systems).
Manual blocks specification for field-level recognition
This case does not need any analysis because the recognition field is directly defined by user or application. Recognizer receives the coordinates of field and type of text and process OCR in specified zone.