Tuning Parameters of Page Preprocessing, Analysis, Recognition, and Synthesis
Document processing in ABBYY FineReader Engine consists of several steps: page preprocessing, analysis, recognition, page synthesis, document synthesis, and export. This section deals with page preprocessing, analysis, recognition, and page/document synthesis. For details about export parameters, see Tuning Export Parameters.
Let's consider the processing stages in order:
- Page preprocessing
During this stage, FineReader Engine automatically improves image quality and corrects defects that can interfere with OCR: page orientation, inverted images, and geometrical distortions. - Layout analysis
During analysis, FineReader Engine finds areas that contain different types of data. These areas are called "blocks." - Recognition
Parts of the image that lie inside the blocks are recognized in ways that depend on the block type. - Page synthesis
The text and background colors, hyperlinks, and other formatting are detected. - Document synthesis
Finally, the font styles and document logical structure are recreated: FineReader Engine detects headings in the recognized document, reconstructs the table of contents, detects captions to pictures and tables, and other elements of document structure.
Before processing, you can set the parameters of page preprocessing, analysis, recognition, and synthesis with the help of the parameter objects. The main object which provides access to all processing parameters is the DocumentProcessingParams object. It has a set of subobjects that influence different processing stages.
Depending on the processing stage, either the pointer to a DocumentProcessingParams object or pointers to its subobjects can be passed to the processing methods as input parameters, and thus affect the results of processing. The FRDocument and FRPage objects provide page preprocessing, analysis, recognition, and synthesis methods.
The processes of page preprocessing, analysis, recognition, and synthesis can also be tuned using profiles, and most popular scenarios are already covered by well-tested predefined profiles. See Working with Profiles for details.
Page processing
To set the parameters of processing of each page, use the properties of the PageProcessingParams subobject of the DocumentProcessingParams object. The PageProcessingParams object is the parent for a group of objects that set up the page processing parameters:
- PagePreprocessingParams
- ColorObjectsProhibitingParams
- PageAnalysisParams
- ObjectsExtractionParams
- RecognizerParams
- SynthesisParamsForPage
The PageProcessingParams object also allows you to turn on or off any of the processing stages. For example, you can set the PerformAnalysis property of the PageProcessingParams object to FALSE if you intend to specify the blocks manually and don't need layout analysis.
Document processing
To set parameters of document processing, in addition to page processing parameters, you need to also set the parameters of document synthesis via the SynthesisParamsForDocument object. During document synthesis font styles and formatting are detected. Those of FineReader Engine objects that deal with document fonts and styles become meaningful only after document synthesis.
You may omit the stage of document synthesis in the following cases:
- If you are going to export recognized text to TXT format. When exporting to this format, synthesis information is not used.
- If you are going to export a document to PDF ImageOnly format. The recognized text and layout information are not used in this mode.
In all other cases, document synthesis must be performed. Omitting document synthesis will cause errors during export.
Note: Methods having the word "Process" in their names (for example, IFRDocument::Process) include the stage of document synthesis. Processing methods of the FRPage object do not include it, so after using them and before export, you must explicitly call some method that performs document synthesis.
You may speed up the stage of document synthesis and decrease memory usage. If you set the DetectFontFormattingAtPageLevel property of the SynthesisParamsForPage object to TRUE during page synthesis, you can then turn off the detection of font parameters and document structure during document synthesis (DetectFontFormatting and DetectDocumentStructure of SynthesisParamsForDocument object). Nevertheless, the quality may deteriorate.
Every time the text or layout of the document is changed (like block is removed or added, or text is edited), we recommend that you recall a document synthesis method. Note that the SynthesizePages method of the FRDocument allows you to specify the collection of pages of the document that were changed, thus only necessary data will be recalculated.
Tuning document processing
A step-by-step procedure that uses the parameter objects mentioned above should look like this:
- Create a DocumentProcessingParams object with the help of the CreateDocumentProcessingParams method of the Engine object.
- Set up the necessary properties of the PageProcessingParams subobject. You do not need to set up all the properties of all the subobjects, as on creation they are initialized with reasonable defaults. You only have to tune up those of the properties that you want to have values other than default ones.
When you are setting up the parameters to be used by the layout analysis functions, do not forget to set the correct values of the properties of the subobjects of the PageProcessingParams that affect recognition. This is recommended, because all these parameters are copied into the blocks that are created during the layout analysis and are then used for recognition, and also because analysis of certain parts of the image may involve recognition.
- If necessary, set up the necessary properties of SynthesisParamsForDocument subobject. You do not need to set up all the properties of all the objects and subobjects, as on creation they are initialized with reasonable defaults. You only have to tune up those of the properties that you want to have values other than the default ones. Check that the value of the PerformSynthesis property of the DocumentProcessingParams object is true.
- You can pass the DocumentProcessingParams object or a set of its subobject to one of the processing methods of the FRDocument, FRPage, and Engine objects.
To recognize a document, we suggest that the processing methods of the FRDocument object be used. This object provides a whole array of processing methods. The most convenient method allowing preprocessing, analysis, recognition, and synthesis using just one method is the Process method. It also uses simultaneous processing features of multiprocessor and multicore systems in the most efficient manner. However, you can also carry out consecutive preprocessing, analysis, recognition, and synthesis using corresponding methods.
C# code
Similar procedures are used in the following code samples: CustomLanguage, VisualComponents; and demo tools: MultiProcessingRecognition, PDFExportProfiles.
See also
07.11.2025 12:48:30