Book Archiving

This scenario is used for processing books, magazines, newspapers to create an electronic library; for instance, when digitizing paper book collections for facilitating and expanding access to them and for their preservation.

Under this scenario, books, magazines, newspapers are converted into non-editable digital copies containing all information from the source in a searchable format. As a result of such processing, the digital copies may be easily found in the electronic library using full-text search. The main difference from the Document Archiving scenario is that during processing a special emphasis is placed on preserving the quality of the recognized text and restoring the structural elements of the document, especially the content.    

To create an electronic copy, image files obtained by scanning or saved in the electronic format first need to go through several processing stages, each of which has its own peculiarities:

  1. Preprocessing of scanned images

Images obtained by scanning may require some preprocessing prior to recognition. For instance, the image of a scanned book may require straightening out of the lines skewed near the fold line, removal of the fold line darks, splitting of the image of a double-page spread into two separate pages.

  1. Recognition of books and newspapers with full restoration of document structure

To extract text data from a document, the document needs to be recognized. When recognizing books and newspapers, restoring the logical structure of a document is of special importance. When processing a large volume of documents, simultaneous document processing may come in useful. In this case, during analysis and recognition, the document load will be spread over processor cores, which makes it possible to speed up processing.

  1. Export to an archive format

The recognized document is saved to a format used for storing data. The most convenient formats for storing documents in an electronic library are PDF, PDF/A, PDF, and PDF/A with MRC. When saving to these formats, you can use a mode in which the text is placed underneath a document image — this will enable you to fully preserve the document formatting and provides a full-text search. The MRC settings allow a significant reduction of file size without loss of visual quality. Also, when saving to the PDF format, one may customize the security settings of the document protecting it from unauthorized viewing and printing.

Scenario implementation

Below is the detailed description of a recommended method of using ABBYY FineReader Engine 12 in this scenario. The proposed method employs the processing settings that are most suitable for the above scenario.

Step 1. Loading ABBYY FineReader Engine

Step 2. Loading settings for the above scenario

Step 3. Loading and preprocessing the images

Step 4. Document recognition

Step 5. Document export

Step 6. Unloading ABBYY FineReader Engine

Required resources

You can use the FREngineDistribution.csv file to automatically create a list of files required for your application to function. For processing with this scenario, select in the column 5 (RequiredByModule) the following values:

Core

Core.Resources

Opening

Opening, Processing

Processing

Processing.OCR

Processing.OCR, Processing.ICR

Processing.OCR.NaturalLanguages

Processing.OCR.NaturalLanguages, Processing.ICR.NaturalLanguages

Export

Export, Processing

Export.Pdf

Export.Pdf, Opening.Pdf

If you modify the standard scenario, change the required modules accordingly. You also need to specify the interface languages, recognition languages and any additional features which your application uses (such as, e.g., Opening.PDF if you need to open PDF files, or Processing.OCR.CJK if you need to recognize texts in CJK languages). See Working with the FREngineDistribution.csv File for further details.

Additional optimization for specific tasks

Below is the overview of the Help topics containing additional information regarding customization of settings at different stages of document processing:

  • Scanning
    • Scanning
      Description of the ABBYY FineReader Engine scenario for document scanning.
  • Opening and preprocessing
    • Image Preprocessing
      Description of the ABBYY FineReader Engine scenario for preliminary preparation of images or enhancement of their visual quality.
  • Recognition
    • Tuning Parameters of Preprocessing, Analysis, Recognition, and Synthesis
      Customization of document processing using objects of analysis, recognition and synthesis parameters.
    • PageProcessingParams Object
      This object enables the customization of analysis and recognition parameters. Using this object, you can indicate which image and text characteristics must be detected (inverted image, orientation, barcodes, recognition language, recognition error margin).
    • SynthesisParamsForPage Object
      This object includes parameters responsible for restoration of a page formatting during synthesis.
    • SynthesisParamsForDocument Object
      This object enables customization of document synthesis: restoration of its structure and formatting.
    • MultiProcessingParams Object
      Simultaneous processing may be useful when processing a large number of images. In this case, the processing load will be spread over the processor cores during image opening and preprocessing, layout analysis, recognition, and export, which makes it possible to speed up processing.
      Reading modes (simultaneous or consecutive) are set using the MultiProcessingMode property. The RecognitionProcessesCount property controls the number of processes that may be started.
  • Export
    • Tuning Export Parameters
      Customization of the document export using objects of export parameters.
    • PDFExportParams Object
      This object allows you to tune PDF (PDF/A) export with only several parameters.
    • To customize the PDF (PDF/A) format export mode, use the TextExportMode property of the PDFExportParams object, and to customize MRC settings, use the MRCMode property.

See also

Basic Usage Scenarios Implementation

24.03.2023 8:51:52

Usage of Cookies. In order to optimize the website functionality and improve your online experience ABBYY uses cookies. You agree to the usage of cookies when you continue using this site. Further details can be found in our Privacy Notice.