Data Extraction

This scenario is used to extract all possible data from a document and store it in a structured way.

The result is a JSON file which represents the document structure. It stores all document objects: printed and handwritten text, tables, barcodes, checkmarks, and images with their location and attributes. This format is optimal for further processing, storing data in a database, or integrating with another application.

A document goes through several processing steps in this scenario:

  1. Preprocessing of scanned images or photos

Images you get by means of a scanner or a digital camera may need some tweaking before they can be optically recognized. For example, noisy images or images with distorted text lines will need some correction for optical recognition to be successful.

  1. Extracting all the data on the document in a structured way

During layout analysis, various objects are detected on the image and put into blocks of corresponding type. The blocks are recognized according to the optimal settings of their type. In the course of synthesis, the logical structure of the document is restored in a consistent manner. The text order even for complex layouts is preserved to be similar to how a human would read it. This ensures that re-recognition of the same document would result in the same order of the text.

  1. Export to a structured format

The recognized document is saved to JSON or XML.

Scenario implementation

Below you will find a detailed description of the recommended method of using ABBYY FineReader Engine 12 to extract the data from documents. The proposed method uses the processing settings that are most suitable for this purpose.

Step 1. Loading ABBYY FineReader Engine

Step 2. Loading settings for the scenario

Step 3. Loading and preprocessing the images

Step 4. Document recognition

Step 5. Document export

Step 6. Unloading ABBYY FineReader Engine

Required resources

You can use the FREngineDistribution.csv file to automatically create a list of files required for your application to function. For processing with this scenario, select in the column 5 (RequiredByModule) the following values:

Core

Core.Resources

Opening

Opening, Processing

Processing

Processing.OCR

Processing.OCR, Processing.ICR

Processing.OCR.NaturalLanguages

Processing.OCR.NaturalLanguages, Processing.ICR.NaturalLanguages

Export

Export, Processing

If you modify the standard scenario, change the required modules accordingly. You also need to specify the interface languages, recognition languages and any additional features which your application uses (such as, for example, Opening.PDF if you need to open PDF files, or Processing.OCR.CJK if you need to recognize texts in CJK languages). See Working with the FREngineDistribution.csv File for further details.

Additional optimization for specific tasks

Below is the overview of the Help topics containing additional information regarding customization of settings at different stages of the document conversion to an editable format:

  • Scanning
    • Scanning
      Description of the ABBYY FineReader Engine scenario for document scanning.
  • Recognition
    • Tuning Parameters of Preprocessing, Analysis, Recognition, and Synthesis
      Customization of document processing using objects of analysis, recognition and synthesis parameters.
    • PageProcessingParams Object
      This object enables the customization of analysis and recognition parameters. Using this object, you can indicate which image and text characteristics must be detected (inverted image, orientation, barcodes, recognition language, recognition error margin).
    • SynthesisParamsForPage Object
      This object includes parameters responsible for restoration of a page formatting during synthesis.
    • SynthesisParamsForDocument Object
      This object enables customization of the document synthesis: restoration of its structure and formatting.
    • MultiProcessingParams Object
      Simultaneous processing may be useful when processing a large number of images. In this case, the processing load will be spread over the processor cores during image opening and preprocessing, layout analysis, recognition, and export, which makes it possible to speed up processing.
      Reading modes (simultaneous or consecutive) are set using the MultiProcessingMode property. The RecognitionProcessesCount property controls the number of processes that may be started.
  • Export

See also

Basic Usage Scenarios Implementation

07.11.2025 12:48:30

Usage of Cookies. In order to optimize the website functionality and improve your online experience ABBYY uses cookies. You agree to the usage of cookies when you continue using this site. Further details can be found in our Privacy Notice.