Parallel Processing with ABBYY FineReader Engine
If you process large numbers of documents, OCR can be very demanding in terms of processing power. It often makes sense to use multi-CPU systems to increase the processing speed, and ABBYY FineReader Engine provides several ways in which you can utilize the multiprocessing capabilities of your hardware configurations.
This section outlines the possible usage scenarios and gives recommendations on choosing the particular multiprocessing mode for your task, together with testing statistics. There are also links to the code samples which provide multiprocessing implementations.
To use multiprocessing, you need to set
the MultiProcessingParams property of the Engine object to proper values.
ABBYY FineReader Engine supports two different objects which provide multiprocessing from a single Engine instance. They are FRDocument object (see Processing with FRDocument object) and BatchProcessor object (see Processing using Batch Processor).
Important! Please take into account that parallel processing requires more RAM than sequential processing. The general recommendation for a workstation is 350 MB * (cores number) + 450MB RAM, and if you are processing documents in Arabic or CJK languages — 850 MB * (cores number) + 750MB RAM.
Usage scenarios
We will take it for granted that you are processing a lot of documents. But we must also take into consideration the results you need to receive and choose the best means to implement your task. The distinct scenarios to consider are:
- Converting multi-page documents with a large number of pages. It generally means processing books, long reports, etc. In this case, you can recognize pages of the document in parallel, then perform synthesis in the main process and export in parallel again. You can also, when using a pool of Engines, process several multi-page documents simultaneously, but the memory consumption can be huge and even lead to "out of memory" errors.
- Converting a large number of one-page documents. This is the case when you process invoices, contracts, letters, etc. Parallel processing is easiest for this situation, as one-page documents do not depend on each other and do not require large amounts of memory at once.
- Processing a large number of images and searching them for a necessary information or working with recognition results in some other way. You might not need to convert most of them into an editable format, so the speed of synthesis and export is not an issue. The operation which will be performed in multiple processes is iterating through layout blocks and accessing the recognition results for text blocks.
Note: If you want to use parallel processing for export, keep in mind that this feature is supported only for export to PDF (except TextOnly mode) and PPTX formats.
Recommendations and restrictions
- For parallel processing of multi-page documents, we recommend using FRDocument. It is the most easy-to-code multiprocessing way because you do not have to implement any additional interfaces.
Opening, preprocessing, analysis, and recognition are performed in parallel; document synthesis is performed sequentially in the main process and then export to PDF (except TextOnly mode), and PPTX formats is performed in parallel. - To process many one-page documents which are received from some source (such as a scanner), we recommend BatchProcessor.
The advantage of this method is that it can be used when you do not know in advance the number of documents they can be of different types and must be processed directly they arrive. The disadvantage is that it requires more implementation effort: you have to implement interfaces for a file adapter and a custom source of images.
All processing stages are performed in parallel because in the case of one-page documents the page and document synthesis are performed for each page separately.
Note: Parallel export is not supported in the scenarios with Batch Processor.
- To catch and handle the events that happened during the parallel processing, you can use the IParallelProcessingCallback interface. This interface can be very useful for managing problematic situations. For example, when the timeout error happens, the IParallelProcessingCallback interface provides several solutions to the problem depending on the user preferences. For more information, see IParallelProcessingCallback::OnWaitIntervalExceeded.
Note: Events that happened during parallel processing of a page are converted to events of a whole document.
Processing with FRDocument object
The number of processes to run is detected automatically depending on the number of available physical or logical CPU cores, number of free CPU cores available in the license, and number of pages in the document. To turn on the multiprocessing mode, do the following:
- Set the value of the MultiProcessingMode property of the MultiProcessingParams subobject of the Engine object. Parallel processing is used if this property is set to MPM_Parallel or MPM_Auto, and the number of pages in the document and the number of available CPU cores are both greater than one.
- Tune the number of processes to be run using the RecognitionProcessesCount property and specify the values of other properties, if necessary.
After setting up multiprocessing settings, you can use the standard procedure of working with FRDocument. ABBYY FineReader Engine will automatically start several recognition processes when you call one of the following methods of the FRDocument object:
- AddImageFile, AddImageFileFromMemory, AddImageFileFromStream, AddImageFileWithPassword, AddImageFileWithPasswordCallback
- Preprocess, PreprocessPages
- Analyze, AnalyzePages
- Recognize, RecognizePages
- Process, ProcessPages
- Export, ExportPages, ExportToMemory — for export to PDF (except TextOnly mode) and PPTX formats only
For each page of the document, a new processing task is created, and this task is passed to one of the recognition processes. When a recognition process completes the task, it receives the next processing task. This is done until all the tasks are processed.
Processing using Batch Processor
When Batch Processor is initialized, asynchronous recognition processes are invoked and configured. Then the processor takes image files from a custom image source. For each page of the image file, a new processing task is created, and this task is passed to one of the recognition processes. If all the tasks for one file have been passed for processing, but not all of the recognition processes are occupied, the next image file from the image queue of the source is taken and passed for processing. This is done until the first image page has been converted and passed to the user. Pages are returned to the user in the order they have been taken from the image source.
To organize multiprocessing with the Batch Processor, do the following:
- Implement the IImageSource and IFileAdapter interfaces, which provide access to the image source and files in it.
- [optional] Implement the IAsyncProcessingCallback interface to manage the processing. The methods of this interface allow you to handle errors and/or cancel the processing.
- [optional] Set up multiprocessing using the MultiProcessingParams subobject of the Engine object. Please note that there is no need to set the MultiProcessingMode property, because parallel processing is used by default if you work with Batch Processor. Tune the number of processes to be run using the RecognitionProcessesCount property and specify the values of other properties, if necessary.
- Call the CreateBatchProcessor method of the Engine object, to receive the BatchProcessor object.
- Call the Start method of this object to initialize the processor and invoke asynchronous recognition processes. You can specify the source of images, pass the references to the IAsyncProcessingCallback interface and parameters objects in the call to this method.
- Call the GetNextProcessedPage method in a loop until the method returns 0, which means that there are no more images in the source and all the processed images have been returned to the user.
Important! The page returned by the GetNextProcessedPage method exists until the next call of this method. Therefore, if you want to save this page, you must save it using the methods of the FRPage object or add it to an existing document using the IFRDocument::AddPage method BEFORE the next call of the GetNextProcessedPage method.
ABBYY FineReader Engine distribution package includes the BatchProcessing sample, which shows how to use Batch Processor.
See also
03.07.2024 8:50:25