Factors affecting performance
When calculating system performance and adjusting it for a scaled-up installation, the following factors should be taken into account.
- Hardware performance
- Processing Station CPU speed
Faster CPUs make for faster OCR. The table below shows how processing performance depends on CPU and number of cores using in it.
Table 1. Number of black-and-white pages (in thousands) processed per 24 hours using 1, 2, 4, and 8 CPU cores (Intel® Xeon™).
No. of CPU cores / Processor model | Intel Xeon E5-2680 v4 2.4 GHz | Intel Xeon E5-2660 V2 2.6 GHz | Intel Xeon E5-2697A v4 2.6 GHz | Intel Xeon E5520 2.27 GHz | Intel Xeon E5-2640 v4 2.4 GHz |
1 | 44 | 49 | 53 | 33 | 67 |
2 | 88 | 98 | 106 | 65 | 134 |
4 | 175 | 195 | 211 | 129 | 266 |
8 | 348 | 388 | 419 | n/a | 530 |
Only one Processing Station was used in the test installation, so the effect of network, file storage, database, and ABBYY FineReader Server loads was negligible. Adding more Processing Stations will result in a non-linear increase in performance
- Number of CPUs
The system can distribute jobs among multiple Processing Stations, engaging multiple CPU cores of the host computer or any other computer in the network. Please observe the following recommendations: - If your calculations show that more than 50 CPU cores are required, be sure to test your installation for bottlenecks (seeMonitoring the system and identifying bottlenecks).
- If your calculations show that between 50 and 100 CPU cores are required, increase this estimate by 20%.
- If your calculations show that more than 100 CPU cores are required, be sure to load-test the installation.
- Disk speed
- Large processing volumes require copying large amounts of data between input, output, and temp folders. Faster disk read/write speeds mean faster file copying.
- SSDs are recommended for computers hosting the Server Manager and for storing input and output files.
- If HDDs are used, consider using separate hard disks for import and export operations.
- Network speed
- If a large number of Processing Stations is connected to the Server Manager, low bandwidth may slow down the distribution of jobs among the stations.
- Using remote Processing Stations may result in 10–20% loss in performance compared to installations where all of the components are installed on one computer.
- Type of input files
Files may come in different formats (e.g. image formats, PDF, or file formats used by common office applications). The table below shows how the format of input files affects performance.
Table 2. Number of black-and-white pages (in thousands) processed per 24 hours using 16 CPU cores (Intel® Core™ I5‑2400 @3.10GHz).
Export/import | Black-and-white TIFF (100-page documents) | Color TIFF (1-page documents) | Color PDF (5-page documents) |
640 | 263 | 318 | |
PDF/A | 608 | 256 | 315 |
- Number of pages in document
ABBYY FineReader Server can easily handle documents containing 25–250 pages. Documents containing more than 1,000 pages may significantly slow down the system. - Image properties (quality, color, resolution)
- Image quality affects both the speed and quality of OCR—good-quality, legible texts will be recognized faster and more accurately.
- Color images are processed longer than black-and-white or grayscale images.
- Images with very high resolutions (i.e. over 600 dpi) will take many times longer to load than standard 300-dpi images.
- Page design (font types and sizes, placement of text and pictures, etc.)
Complicated layouts take longer to process and the quality of OCR may be lower. - Image size
Non-linear decrease in performance is possible on images larger than А3. - Processing parameters
- Workflow
- Maximum speeds will be attained when using Hot Folder or DocLibrary workflows, provided that the documents are stored locally.
- If a DocLibrary workflow is used, ABBYY FineReader Server will need to crawl the file store for files that require processing. Crawling is a single-thread process, and any delays introduced at this stage (e.g. when processing a SharePoint library or a remote file store) will cause the server to become idle while the program is looking for files.
- If an E-mail workflow is used, the single-thread POP3 and Exchange mail (MAPI) protocols will be slower than IMAP, which allows multiple threads. Note, however, that IMAP may restrict the number of server connections.
- To speed up E-mail workflows, avoid storing very large amounts of mail in the mail box. The first time you run an E-mail workflow, all of the messages stored in the mail box will be loaded. Next time you run this workflow, ABBYY FineReader Server will crawl the mail box only for new messages.
- Splitting documents into fragments for concurrent processing
- By default, multi-page documents are split into fragments of 25 pages each. You can reduce the default size and split documents into fragments as small as one page. This may produce 30–50% gain in processing speed per each job, but will require adding more CPU cores to maintain the same load.
- When processing very large documents, you can increase the size of a fragment to include up to 100–500 pages or even the entire document. This will reduce network loads and will work well for "regular" A4 documents but may make the program more prone to errors on complicated documents, such as large technical drawings or cardiograms.
- Document queues
By default, the maximum queue size is 50 jobs. However, increasing the number of processing cores you are using requires increasing the size of the queue to avoid processes becoming idle. Ideally, the number of jobs should be twice the number of processes. From FineReader Server 14 Release 3 Update 3 onwards, the queue size increases automatically in this case. - Image preprocessing parameters
Depending on what files are fed to the system, additional image preprocessing stages may either improve or reduce performance. You will need to carry out additional tests to find optimal image preprocessing parameters. - OCR mode
Selecting the Quality mode will reduce performance by ~45% compared to the Fast mode, but will result in more accurate OCR for documents with complicated layouts and design.
Table 3. Number of pages (in thousands) processed per 24 hours using 16 CPU cores (Intel® Core™ I5‑2400 @3.10GHz).
Export/import | European languages, Fast mode | European languages, Quality mode | CJK languages, Fast mode | CJK languages, Quality mode |
410 | 225 | 280 | 156 |
- OCR language
- Documents written in CJK and Arabic languages take ~30% longer to process than documents written in European languages (see Table 4 for details).
Table 4. Number of pages (in thousands) processed per 24 hours using 16 CPU cores (Intel® Core™ I5‑2400 @3.10GHz, Fast mode).
Export/import | Chinese Simplified | Japanese | Korean | Arabic |
236 | 272 | 332 | 286 |
- If an incorrect OCR language is selected, processing may take twice as long. You will see a warning in the status of the job saying that the wrong language has been selected.
- Output format
Export to PDF/A is ~3% slower than export to PDF. - Programs used for opening files in office formats (either built-in or external)
- External processors can be used for parallel processing, which may significantly improve performance on some documents.
- When processing born-digital documents, avoid running more than 5 processes on one Processing Station, as this may lead to queue overflow and cause some of the jobs to time out.
- Developer logs
Enabling logs may reduce performance by up to 15%.
26.03.2024 13:49:49