English (English)

Chinese Simplified (简体中文)

Document Archiving

This scenario is used for processing paper documents to save them into a digital archive, especially when creating an archive of contracts, project documentation, invoices, certificates, etc.

In this processing scenario, paper documents are converted into non-editable digital copies containing all the document information in a searchable format. As a result of such processing, digital copies of documents may be easily found in an electronic archive using full-text search, document text segments may be copied, and documents may be sent by e-mail or printed out.

To create a digital copy, the document first needs to go through several processing stages, each of which has its own peculiarities:

Preprocessing of scanned images

Scanned images may require some preprocessing prior to recognition, for example, if scanned documents contain background noise, skewed text, inverted colors, black margins, wrong orientation, or resolution.

Simultaneous recognition of a large volume of documents

To extract text data from a document, it must be recognized. When processing a large volume of documents, simultaneous document processing may come in useful. In this case, analysis and recognition workload can be spread over the processor cores, which makes it possible to speed up processing.

Export to an archive format

The recognized document is saved to a suitable storage format. The most convenient formats for storing documents are PDF, PDF/A, PDF, and PDF/A with MRC. When saving to these formats, one may use a mode, under which the text is placed underneath the document image — this enables full preservation of the document formatting and provides a full-text search. The MRC settings allow a significant reduction of file size without loss of visual quality. Also, when saving to the PDF format, one may customize the security settings of the document protecting it from unauthorized viewing and printing.

Scenario implementation

Below is the detailed description of the recommended method of using ABBYY FineReader Engine 12 for creating digital copies of the documents for archiving. The proposed method uses processing settings that are most suitable for this purpose. In this implementation the document scanning phase is omitted. Please see Additional optimization for specific tasks below for the tips on implementing scanning.

Step 1. Loading ABBYY FineReader Engine

Step 2. Loading settings for the scenario

Profile name	Description
DocumentArchiving_Accuracy	The settings have been optimized for accuracy: Enables detection of maximum text on an image, including text embedded into the image. Full synthesis of the logical structure of a document is not performed. Important! The profile is not intended for converting a document into RTF, DOCX, or text-only PDF. Use the document conversion profiles for such purpose.
DocumentArchiving_Speed	The settings have been optimized for processing speed: Enables detection of maximum text on an image, including text embedded into the image. Skew correction is not performed. Full synthesis of the logical structure of a document is not performed. The processes of document analysis and recognition are speeded up. Important! The profile is not intended for converting a document into RTF, DOCX, or text-only PDF. Use the document conversion profiles for such purpose.

Step 3. Loading and preprocessing the images

Step 4. Document recognition

Step 5. Document export

Step 6. Unloading ABBYY FineReader Engine

Required resources

You can use the FREngineDistribution.csv file to automatically create a list of files required for your application to function. For processing with this scenario, select in the column 5 (RequiredByModule) the following values:

Core

Core.Resources

Opening

Opening, Processing

Processing

Processing.OCR

Processing.OCR, Processing.ICR

Processing.OCR.NaturalLanguages

Processing.OCR.NaturalLanguages, Processing.ICR.NaturalLanguages

Export

Export, Processing

Export.Pdf

Export.Pdf, Opening.Pdf

If you modify the standard scenario, change the required modules accordingly. You also need to specify the interface languages, recognition languages and any additional features which your application uses (such as, e.g., Opening.PDF if you need to open PDF files, or Processing.OCR.CJK if you need to recognize texts in CJK languages). See Working with the FREngineDistribution.csv File for further details.

Additional optimization for specific tasks

Below is the overview of the Help topics containing additional information regarding customization of settings at different stages of document processing:

Scanning

Tips for Document Scanning
Getting quality images from scanning paper documents.

Opening and preprocessing

Image Preprocessing
Description of the ABBYY FineReader Engine scenario for preliminary preparation of images and enhancement of their visual quality.

Recognition

Tuning Parameters of Preprocessing, Analysis, Recognition, and Synthesis
Customization of document processing using objects of analysis, recognition and synthesis parameters.
PageProcessingParams Object
This object enables customization of analysis and recognition parameters. Using this object, you can indicate which image and text characteristics must be detected (inverted image, orientation, bar codes, recognition language, recognition error margin).
SynthesisParamsForPage Object
This object includes parameters responsible for restoration of a page formatting during synthesis.
SynthesisParamsForDocument Object
This object enables customization of the document synthesis: restoration of its structure and formatting.

Export

Tuning Export Parameters
Customization of document export using objects of export parameters.
PDFExportParams Object
This object allows you to tune PDF (PDF/A) export with only several parameters.
To customize the PDF (PDF/A) format export mode, use the TextExportMode property of the PDFExportParams object, and to customize MRC settings, use the MRCMode property.
In addition, you can customize image export settings to ensure faster processing, additional reduction of file size, etc. For example, you can save a colored image as a grayscale, or black and white image, if this fits your scenario (use the Colority property of the PDFExportParams object).
You can change the image resolution in such a way that the resulting electronic copy may subsequently be printed out on a printer, viewed on a computer screen, or you can select low resolution allowing only for the reading of a text and providing very poor quality of graphics (use the Resolution and ResolutionType property of the PDFExportParams object).

Separation into documents

Under this scenario, the batch of images may have to be separated into documents. ABBYY FineReader Engine 12 does not support automatic document separation. However, you can use ABBYY FlexiCapture Engine to implement automatic separation. The documents may be separated, for instance, based on the number of pages in a document or based on pages having separating barcodes. When implementing barcode separation, you can use the scenario for extraction of barcode values only from the document.

Your use of this site is conditioned on Your continued compliance with the Terms of Use.

Terms of Use

Disclaimer of Warranty

Limitation of Liability

Transmission and Submission of Information

Downloads

Use of Content

Trademarks

Links to Third-Party Sites

Foreign Legislation

Subscription Terms

Partner Subscription Terms