Document Comparison
When working with paper documents, you need to find and correct the mistakes or intentionally made changes. Use Document Comparison API to search for these changes quickly and efficiently.
This scenario is used to compare the documents of special importance, such as contracts and bank documentation, with their copies. The comparison result contains the information about differences in the type of content (text only), kind of modification (deleted, inserted, or modified), and their locations in the original and the copy. You may get the list of the detected differences or the region of any change and save the comparison result to an external file for further processing or long-term storage.
To compare the documents or pages, the files obtained by scanning or saved in the electronic format typically go through several processing stages, each of which has its own peculiarities:
- Preprocessing of scanned files or images
The files and their copies require some preprocessing prior to recognition, if they include some defects or purposely made notations, such as signatures or stamps.
- Recognition with full restoration of document structure and formatting
When recognizing a document, various layout elements (text, tables, images, separators, etc.) of the document are identified. In the course of the document synthesis, the logical structure of the document is restored, while the page synthesis enables one to fully restore the document formatting (fonts, styles, etc.)
- Documents or pages comparison
To compare the documents or pages with their copies, use the files that were recognized using ABBYY FineReader Engine. You may use the two versions of a document across different formats. After comparison, you got the result with the list of the changes, use it to retrieve the information about the location of the changes. If you are using manual verification, use this information to highlight the changes in the text, making the operator's job easier.
- Export to an external format
You may also save the comparison result in XML and DOCX format.
The procedure described below is also illustrated by a Document Comparison sample.
Scenario implementation
Below follows a detailed description of the recommended method of using ABBYY FineReader Engine in this scenario.
Step 1. Loading ABBYY FineReader Engine
Step 2. Loading and preprocessing the files and images
Step 3. Document recognition
Step 4. Comparing the documents or pages
Step 5. Working with the detected changes
Step 6. Exporting the comparison result
Step 7. Unloading ABBYY FineReader Engine
Required resources
You can use the FREngineDistribution.csv file to automatically create a list of files required for your application to function. For processing with this scenario, select in the column 5 (RequiredByModule) the following values:
Core
Core.Resources
Opening
Opening, Processing
Processing
Processing.OCR
Processing.OCR, Processing.ICR
Processing.OCR.NaturalLanguages
Processing.OCR.NaturalLanguages, Processing.ICR.NaturalLanguages
Export
Export, Processing
If you modify the standard scenario, change the required modules accordingly. You also need to specify the interface languages, recognition languages and any additional features which your application uses (such as, e.g., Opening.PDF if you need to open PDF files, or Processing.OCR.CJK if you need to recognize texts in CJK languages). See Working with the FREngineDistribution.csv File for further details.
Additional optimization for specific tasks
Below is the overview of the Help topics containing additional information regarding customization of settings at different processing stages:
- Opening and preprocessing
- Image Preprocessing
Description of the ABBYY FineReader Engine scenario for preliminary preparation of images or enhancement of their visual quality. - Recognition
- Tuning Parameters of Preprocessing, Analysis, Recognition, and Synthesis
Customization of document processing using objects of analysis, recognition, and synthesis parameters. - PageProcessingParams Object
This object enables customization of analysis and recognition parameters. Using this object, you can indicate which image and text characteristics must be detected (inverted image, orientation, barcodes, recognition language, recognition error margin). - SynthesisParamsForPage Object
This object includes parameters responsible for restoration of a page formatting during synthesis. - SynthesisParamsForDocument Object
This object enables customization of the document synthesis: restoration of its structure and formatting. - MultiProcessingParams Object
Simultaneous processing may be useful when processing a large number of images. In this case, the processing load will be spread over the processor cores during image opening and preprocessing, layout analysis, recognition, and export, which makes it possible to speed up processing.
Reading modes (simultaneous or consecutive) are set using the MultiProcessingMode property. The RecognitionProcessesCount property controls the number of processes that may be started.
See also
03.07.2024 8:50:25