Document Analysis

Basic document analysis features

Document Analysis is a set of functions for automatic detection of the following objects on a page:

  • Text blocks
  • Pictures
  • Tables and table cells
  • Barcodes
  • Separators

Additionally document analysis provides some special features to prepare image for OCR:

  • process detection of page orientation — 90, 180, and 270 degrees
  • split double pages
  • process vertical text detection in table cells
  • detect and mark the blocks of garbage on page

This preparation is significantly important to specify which fields on page should be recognized and what should be kept in initial form.

And also there is an ability to select the field for recognition manually. In this case, you have to set the field’s coordinates and type of data inside. It is used in Field-Level Recognition scenario mostly for data capture.

ABBYY FineReader Engine 12 provides 3 automatic and 1 manual types of document analysis:

General document analysis

This is default document analysis type which searches all objects: text blocks, pictures, tables, barcodes and separators. The results of this analysis are used for document structure and layout retrieval in content reuse scenario. All pictures and diagrams are preserved in original form without recognizing text on them.

Document analysis for invoices

This is a preprocessing engine for converting semi-structured documents, such as invoices, payment drafts, bills, waybills, business cards, agreements, health claim forms, resumes, etc. It has been designed to accurately locate all the text on these documents, including characters and numbers — even if this information is located within stamps, pictures, logos or small-text areas.

Unlike the standard full-page document analysis, this one assumes that all printed information on documents is text. It also ensures that important text information is not identified as graphic elements and words or numerical values are not separated into multiple characters. As a result, maximum information about the text, including its coordinates, is available for analysis, field-by-field processing and parsing at subsequent processing stages by other systems.

Document analysis for full-text indexing

Automatically detects and recognizes all text on documents including text embedded in pictures, charts, and diagrams. Developers may choose to use this mode of document analysis to extract exhaustive full-text information on documents needed for document index building (as in DMS, CMS, Archiving systems).

Manual blocks specification for field-level recognition

This case does not need any analysis because the recognition field is directly defined by user or application. Recognizer receives the coordinates of field and type of text and process OCR in specified zone.

See also

Key Features

24.03.2023 8:51:52

Usage of Cookies. In order to optimize the website functionality and improve your online experience ABBYY uses cookies. You agree to the usage of cookies when you continue using this site. Further details can be found in our Privacy Notice.