Parallel Processing with ABBYY FineReader Engine

If you process large numbers of documents, OCR can be very demanding in terms of processing power. It often makes sense to use multi-CPU systems to increase the processing speed, and ABBYY FineReader Engine provides several ways in which you can utilize the multiprocessing capabilities of your hardware configurations.

This section outlines the possible usage scenarios and gives recommendations on choosing the particular multiprocessing mode for your task, together with testing statistics. There are also links to the code samples which provide multiprocessing implementations.

Basically, there are three ways multiprocessing is intended to be used with ABBYY FineReader Engine:

On the one hand, you can use one Engine object, setting the MultiProcessingParams property of the Engine object to proper values.

ABBYY FineReader Engine supports two different objects which provide multiprocessing from a single Engine instance. They are FRDocument object (see Processing with FRDocument object) and BatchProcessor object (see Processing using Batch Processor).

On the other hand, you can load several instances of Engine as out-of-process servers by means of COM (using the OutprocLoader object) and use each instance in its process. See Processing using a pool of Engines.

Important! Please take into account that parallel processing requires more RAM than sequential processing. The general recommendation for a workstation is 350 MB * (cores number) + 450MB RAM, and if you are processing documents in Arabic or CJK languages — 850 MB * (cores number) + 750MB RAM.

Usage scenarios

We will take it for granted that you are processing a lot of documents. But we must also take into consideration the results you need to receive and choose the best means to implement your task. The distinct scenarios to consider are:

  • Converting multi-page documents with a large number of pages. It generally means processing books, long reports, etc. In this case, you can recognize pages of the document in parallel, then perform synthesis in the main process and export in parallel again. You can also, when using a pool of Engines, process several multi-page documents simultaneously, but the memory consumption can be huge and even lead to "out of memory" errors.
  • Converting a large number of one-page documents. This is the case when you process invoices, contracts, letters, etc. Parallel processing is easiest for this situation, as one-page documents do not depend on each other and do not require large amounts of memory at once.
  • Processing a large number of images and searching them for a necessary information or working with recognition results in some other way. You might not need to convert most of them into an editable format, so the speed of synthesis and export is not an issue. The operation which will be performed in multiple processes is iterating through layout blocks and accessing the recognition results for text blocks.

Note: If you want to use parallel processing for export, keep in mind that this feature is supported only for export to PDF (except TextOnly mode) and PPTX formats.

Recommendations and restrictions

  • For parallel processing of multi-page documents, we recommend using FRDocument. It is the most easy-to-code multiprocessing way because you do not have to implement any additional interfaces.
    Opening, preprocessing, analysis, and recognition are performed in parallel; document synthesis is performed sequentially in the main process and then export to PDF (except TextOnly mode), and PPTX formats is performed in parallel.
  • To process many one-page documents which are received from some source (such as a scanner), we recommend BatchProcessor.
    The advantage of this method is that it can be used when you do not know in advance the number of documents they can be of different types and must be processed directly they arrive. The disadvantage is that it requires more implementation effort: you have to implement interfaces for a file adapter and a custom source of images.
    All processing stages are performed in parallel because in the case of one-page documents the page and document synthesis are performed for each page separately.

Note: Parallel export is not supported in the scenarios with Batch Processor.

  • To perform full processing of many one-page documents in parallel, you can use a pool of Engines loaded out-of-process by means of COM. This method is the most efficient in speed and automatically eliminates all difficulties related to multi-threading: all operations with the ABBYY FineReader Engine objects are serialized by means of COM. But it has some limitations:
    • due to the use of COM, you need to register FREngine.dll;
    • if your code is written in C++, working with COM requires more routine coding than, for example, in C#;
    • in this case, the processing is going on in another process, so you cannot open images from memory, and iterating the recognition results takes more time because each request has to be passed into another process and back;
    • and finally, loading several instances of Engine means more memory consumption, especially, as in this case, all processing stages are performed in parallel, and several simultaneous synthesis operations can go on at the same time, using up still more memory.

Note: Events that happened during parallel processing of a page are converted to events of a whole document.

Speed testing results

In the table below are presented the results of performance testing.

One-page documents One multi-page document Searching through results without exporting
Sequential processing 60 51 87
Processing with FRDocument 41 117 57
Processing with FRDocument (with PageFlushingPolicy = PFP_KeepInMemory) 55 141 82
Processing using Batch Processor 99 115 294
Processing using a pool of Engines 165 10 102

The processor of the testing machine is Intel® Core™ i5-4440 (3.10 GHz, 4 physical cores), 8 GB of RAM, the number of simultaneously run processes is 4. Performance was tested on 300 English-language images, with the settings of the DocumentArchiving_Speed predefined profile. The numbers in the table are pages processed per minute. In the scenarios "one-page documents" and "one multi-page document" the documents are exported to PDF format.

Processing with FRDocument object

The number of processes to run is detected automatically depending on the number of available physical or logical CPU cores, number of free CPU cores available in the license, and number of pages in the document. To turn on the multiprocessing mode, do the following:

  1. Set the value of the MultiProcessingMode property of the MultiProcessingParams subobject of the Engine object. Parallel processing is used if this property is set to MPM_Parallel or MPM_Auto, and the number of pages in the document and the number of available CPU cores are both greater than one.
  2. Tune the number of processes to be run using the RecognitionProcessesCount property and specify the values of other properties, if necessary.

After setting up multiprocessing settings, you can use the standard procedure of working with FRDocument. ABBYY FineReader Engine will automatically start several recognition processes when you call one of the following methods of the FRDocument object:

For each page of the document, a new processing task is created, and this task is passed to one of the recognition processes. When a recognition process completes the task, it receives the next processing task. This is done until all the tasks are processed.

C# code

ABBYY FineReader Engine distribution package includes the MultiProcessingRecognition demo tool, which demonstrates the gain in speed when using multi-processing recognition with FRDocument object and contains an implementation that you can use to start developing your own application.

Processing using Batch Processor

When Batch Processor is initialized, asynchronous recognition processes are invoked and configured. Then the processor takes image files from a custom image source. For each page of the image file, a new processing task is created, and this task is passed to one of the recognition processes. If all the tasks for one file have been passed for processing, but not all of the recognition processes are occupied, the next image file from the image queue of the source is taken and passed for processing. This is done until the first image page has been converted and passed to the user. Pages are returned to the user in the order they have been taken from the image source.

To organize multiprocessing with the Batch Processor, do the following:

  1. Implement the IImageSource and IFileAdapter interfaces, which provide access to the image source and files in it.
  2. [optional] Implement the IAsyncProcessingCallback interface to manage the processing. The methods of this interface allow you to handle errors and/or cancel the processing.
  3. [optional] Set up multiprocessing using the MultiProcessingParams subobject of the Engine object. Please note that there is no need to set the MultiProcessingMode property, because parallel processing is used by default if you work with Batch Processor. Tune the number of processes to be run using the RecognitionProcessesCount property and specify the values of other properties, if necessary.
  4. Call the CreateBatchProcessor method of the Engine object, to receive the BatchProcessor object.
  5. Call the Start method of this object to initialize the processor and invoke asynchronous recognition processes. You can specify the source of images, pass the references to the IAsyncProcessingCallback interface and parameters objects in the call to this method.
  6. Call the GetNextProcessedPage method in a loop until the method returns 0, which means that there are no more images in the source and all the processed images have been returned to the user.

Important! The page returned by the GetNextProcessedPage method exists until the next call of this method. Therefore, if you want to save this page, you must save it using the methods of the FRPage object or add it to an existing document using the IFRDocument::AddPage method BEFORE the next call of the GetNextProcessedPage method.

C# code

ABBYY FineReader Engine distribution package includes the BatchProcessing sample, which shows how to use Batch Processor, and the BatchProcessingRecognition demo tool, which shows the gain in speed when using multi-processing recognition with the Batch Processor.

Processing using a pool of Engines

In this multiprocessing scenario, you use several instances of Engine loaded out-of-process. Inside each worker thread, the procedure can be almost the same as for processing just for one thread. But it is recommended that you implement a custom image source that will distribute the images among threads, using some kind of synchronizing object to ensure that each image is processed once and only once.

To load the Engine object out-of-process, use the OutprocLoader object, which implements an IEngineLoader interface. When using it with special accounts, permissions may be required to run OutprocLoader for such accounts.

C# code

Additionally, you can manage the priority of a host process and control whether it is alive using the IHostProcessControl interface.

Notes:

  • Account permissions can be set up using the DCOM Config utility (either type DCOMCNFG in the command line, or select Control Panel > Administrative Tools > Component Services). In the console tree, locate the Component Services > Computers > My Computer > DCOM Config folder, right-click ABBYY FineReader Engine 12.2 Loader (Local Server), and click Properties. A dialog box will open. Click the Security tab. Under Launch Permissions, click Customize, and then click Edit to specify the accounts that can launch the application.

    Note that on a 64-bit operating system the registered DCOM-application is available in the 32-bit MMC console, which can be run using the following command line:

    "mmc comexp.msc /32"
    
  • To register FREngine.dll when installing your application on an end-user computer, use the following command line:

    regsvr32 /s /n /i:"<path to the Inc folder>" "<path to FREngine.dll>"
    
  • Implementing Engine as an out-of-process server specify sequential mode of document processing by setting MultiProcessingMode property of MultiProcessingParams object to MPM_Sequential.

ABBYY FineReader Engine distribution package includes the EnginesPool sample, which shows the gain in speed when using a pool of Engines and provides a ready-to-use implementation, which can be the starting point of your own application. Consult the source code of this sample for details on implementing a custom image source, handling exceptions, and controlling CPU cores usage.

See also

FRDocument

BatchProcessor

MultiProcessingParams

Iterating Document Pages

Exporting Large Documents

Different Ways to Load Engine Object

Using ABBYY FineReader Engine in Multi-Threaded Server Applications

17.09.2024 15:14:40

Usage of Cookies. In order to optimize the website functionality and improve your online experience ABBYY uses cookies. You agree to the usage of cookies when you continue using this site. Further details can be found in our Privacy Notice.