Document Conversion
The result of this scenario is an editable version of a document.
In this scenario, document images are recognized, retaining all the original formatting intact, and the data are saved to an editable file format. As a result, you get editable versions of your documents, which can be easily checked for errors and modified. You will also be able to copy all or some of the text for reuse.
A document goes through several processing steps, which are in some ways slightly different from the other common scenarios:
- Preprocessing of scanned images or photos
Images you get by means of a scanner or a digital camera may need some tweaking before they can be optically recognized. For example, noisy images or images with distorted text lines will need some correction for optical recognition to be successful.
- Recognition with full restoration of document structure and formatting
When recognizing a document, various layout elements (text, tables, images, separators, etc.) of the document are identified. In the course of the document synthesis, the logical structure of the document is restored, while the page synthesis enables one to fully restore the document formatting (fonts, styles, etc.)
- Export to an editable format
The recognized document is saved to an editable format, such as RTF, DOCX.
Scenario implementation
Below you will find a detailed description of the recommended method of using ABBYY FineReader Engine 12 to convert the documents. The proposed method uses the processing settings that are most suitable for this purpose.
Step 1. Loading ABBYY FineReader Engine
Step 2. Loading settings for the scenario
Step 3. Loading and preprocessing the images
Step 4. Document recognition
Step 5. Document export
Step 6. Unloading ABBYY FineReader Engine
Required resources
You can use the FREngineDistribution.csv file to automatically create a list of files required for your application to function. For processing with this scenario, select in the column 5 (RequiredByModule) the following values:
Core
Core.Resources
Opening
Opening, Processing
Processing
Processing.OCR
Processing.OCR, Processing.ICR
Processing.OCR.NaturalLanguages
Processing.OCR.NaturalLanguages, Processing.ICR.NaturalLanguages
Export
Export, Processing
If you modify the standard scenario, change the required modules accordingly. You also need to specify the interface languages, recognition languages and any additional features which your application uses (such as, e.g., Opening.PDF if you need to open PDF files, or Processing.OCR.CJK if you need to recognize texts in CJK languages). See Working with the FREngineDistribution.csv File for further details.
Additional optimization for specific tasks
Below is the overview of the Help topics containing additional information regarding customization of settings at different stages of the document conversion to an editable format:
- Opening and preprocessing
- Image Preprocessing
Description of the ABBYY FineReader Engine scenario for preliminary preparation of images or enhancement of their visual quality. - Recognition
- Tuning Parameters of Preprocessing, Analysis, Recognition, and Synthesis
Customization of document processing using objects of analysis, recognition and synthesis parameters. - PageProcessingParams Object
This object enables the customization of analysis and recognition parameters. Using this object, you can indicate which image and text characteristics must be detected (inverted image, orientation, barcodes, recognition language, recognition error margin). - SynthesisParamsForPage Object
This object includes parameters responsible for restoration of a page formatting during synthesis. - SynthesisParamsForDocument Object
This object enables customization of the document synthesis: restoration of its structure and formatting. - MultiProcessingParams Object
Simultaneous processing may be useful when processing a large number of images. In this case, the processing load will be spread over the processor cores during image opening and preprocessing, layout analysis, recognition, and export, which makes it possible to speed up processing.
Reading modes (simultaneous or consecutive) are set using the MultiProcessingMode property. The RecognitionProcessesCount property controls the number of processes that may be started. - Export
- Tuning Export Parameters
Customization of the document export using objects of export parameters. - RTFExportParams Object
This object enables customization of the RTF/DOCX/ODT saving format parameters. - HTMLExportParams Object
This object allows customization of export to the HTML format. - PPTExportParams Object
Object for customization of the PPTX saving format parameters.
See also
03.07.2024 8:50:25