Scaling

FlexiCapture can process from several hundred to millions of pages per day, and support up to several thousands of operators. With guidelines from this paper, it is easy to estimate the System load in advance and select both the appropriate architecture and hardware for the servers.

The System scales up by:

  • increasing the number of scanning clients, verification clients, and Processing Stations;
  • increasing the power of machines for the Application, Processing, Licensing, and Database servers, and the FileStorage, using several machines for these roles.

The numbers below help to assess or select a preliminary configuration of the FlexiCapture server component.

Number of pages processed in 24 hrs Number of processing cores Number of verification operators Number of scanning operators Configuration
black-and-white only grayscale only color only
20,000 5,000 1,000 8 3 3 Demo
1 million 500,000 300,000 80 100 300 Medium
3 million 2 million 1 million 120 500 1,000 Large (Medium 10 Gb/s)
Much more xLarge (combination of ABBYY FlexiCapture installations)

Bottleneck monitoring helps to ascertain that the hardware used is not sufficient for the desired performance and that it’s high time to scale up.

Demo is a typical configuration for demonstrations or pilot projects, not recommended for production-scale projects. All the System components are installed on a virtual machine or deployed on PC.

Machine Role Requirements

ABBYY FlexiCapture

1 computer:

4-core CPU, 2.4 GHz

8 GB RAM

HDD:

  • 100 GB for OS and temporary files
  • 100 GB for Database and FileStorage

OS: Windows 2012 or later

MS SQL Express may be used as a database server and installed on the same machine with FlexiCapture servers. Instead of using separate FileStorage, files can be stored directly in the database. Operators and processing stations can be installed on the same machine.

Note: In commercial projects, the Processing Station should never be installed on a computer hosting FlexiCapture servers or Database server, because it hogs up all resources and server performance deteriorates.

Medium is a typical configuration for commercial projects, because it is scalable: each server component is installed on a dedicated machine.

The Application Server should be installed on a dedicated machine, because it employs a scaling-up approach that is different from the Database, Processing, and Licensing servers.

Note: Technically, the Application Server, Processing Server, and Licensing Server can be installed on the same computer. Server redundancy will be ensured, but the Application Server’s scalability will not.

  • The Application Server is a web-service in IIS; its scaling and reliability are achieved by clustering that uses Microsoft Network Load Balancing technology. All cluster nodes are peers running in active-active mode and can be switched off at any time.
  • The Processing Server and Licensing Server are Windows services; their reliability is achieved by creating an active-passive cluster based on Microsoft Failover Cluster technology.

Microsoft clearly prohibits the use of these technologies together on the same computer.

If reliability is all you need, cluster the Application Server within IIS, which supports clustering by Microsoft Failover Cluster as well.

Licensing and Processing servers can be installed on the same machine.

We recommend installing the Database Server on a dedicated machine. It is very resource-consuming and if you do combine it with certain other FlexiCapture Servers, restrict its use of the CPU and RAM and locate the database files on a physically separate HDD, so as not to affect the performance of the neighboring server.

For small loads and better performance, you can use fast HDDs on the Application Server machine as a FileStorage: e.g. 15,000 RPM or faster SATA2 disks, arranged in at least RAID1 for redundancy, or RAID10 for better performance as well.

At later stages of the project, however, if the volume of pages to process increases, this configuration will likely result in a bottleneck, especially for processing grayscale or color images, and the problem is that it can’t be scaled up on the fly – it will require the System to go down and other HDDs to be attached.

Use external storages like NAS or SAN, to which the Application Server has read-write access at 1 Gb/s over LAN, SCSI, Fibre Channel, etc. This will enable a smooth scaling up of the FileStorage.

The following text contains an explanation on how to calculate the required performance of the FileStorage hardware.

A typical FlexiCapture network configuration in an enterprise environment:

Note that it is recommended to connect the Application Server directly to the FileStorage and Database Server for fast and reliable communication.

Machine Role Requirements
Application Server

CPU: 8 physical cores, 2.4 GHz or faster

16 GB RAM

HDD: 100 GB

2 NICs, 1 Gb/s:

  • one to connect to LAN and
  • one to connect to the Database server

FileStorage: If SAN is used, connect it using SCSI, Fibre Channel or InfiniBand.

OS: Windows 2012 or later

A web-service and the hub of all FlexiCapture communications, the Application Server is responsible for both:

  • transferring large binary bodies; and
  • fast responses to small SOAP/JSON service requests.

Critical resources are:

  1. Fast network interface for connecting to clients
  2. Fast, stable connection to the FileStorage and Database Server
  3. Multi-core CPU with high speed.
    • The higher the speed, the faster each request is handled.
    • The more physical cores there are, the more requests that are handled at the same time.

To make the most out of the CPU, for the FlexiCapture Web Services Application pool, take twice as many IIS Worker Processes, as the number of physical cores. E.g. 16 IIS Worker Processes for an 8-core processor.

  1. Sufficient RAM, at least 2 GB per 1 physical core.

If any of these resources causes a bottleneck, scale-up the Application Server:

  • via Microsoft Network Load Balancing technology – it clusters multiple computers with the Application Server role. See detailed instructions in FlexiCapture System Administrator’s Guide.
  • on a hardware level by connecting a different range of clients to different machines with the Application Server role. E.g. you can use one machine to serve all automatic processing, and another to expose it to external clients.

In any case, all machines with the Application Server role should be equally connected to the same Database and FileStorage.

Processing Server, Licensing Server

4-core CPU, 2.4 GHz or faster

8 GB RAM

HDD: 100 GB

NIC 1 GB/s for connecting to LAN

OS: Windows 2012 or later

A stable network connection is essential for the servers. Otherwise, document processing will stop. To ensure redundancy, use Microsoft Failover Cluster. See detailed instructions in FlexiCapture System Administrator’s Guide.

The Licensing Server handles copies of licenses for all concurrent clients in its memory. Please keep this in mind if you are going to use a large number of scanning and verification operators simultaneously. We also recommend using the 64-bit version in projects involving a large number of concurrent clients. Our tests have shown that 2GB of RAM is enough to handle licenses for up to 1000 clients. Consider using more than one Licensing Server to serve more concurrent clients simultaneously.

Database Server

For MS SQL Server:

Database: MS SQL Server 2014 or higher, Standard or Enterprise Edition

Hardware:

CPU: 8 physical cores, 3.4 GHz or faster

16 GB RAM or more

HDD: 400 GB

OS: Windows 2012 or later

For Oracle:

Database: Oracle 12c Enterprise Edition

Hardware: Oracle Exadata Database Machine X2-2, Quarter Rack

ABBYY FlexiCapture supports MS SQL Server and Oracle installed on any platform. Both Database servers keep their own records on optimal settings, scaling, and fault tolerance.

Recommended for the MS SQL Server:

  • More RAM if possible on the Database Server machine to host the greater part of the database files in RAM and to access it faster;
  • Fast HDD for quick access to the Database part hosted on the disk (we recommend using SSD for this purpose);
  • Avoiding the Database modes with transaction delays (Mirroring, etc.);
  • Choosing Simple as a database Recovery Model;
  • The Database and its log are stored on separate disks;
  • Regular index updates for frequently changing tables (Document, Page, Batch, Task, and EventLog). Failing this, the size of an index may become larger than the size of the data in the table.
FileStorage

NAS or SAN,

connected via LAN, SCSI, Fibre Channel, or InfiniBand

Read-write speed: 100 MB/s*

Capacity: 5 TB*

*Read-write and capacity requirements greatly depend on these 2 factors:

1. Average and peak pages processed per day (i.e. 24 hours) and per hour, and their color mode. As mentioned in the Performance Metrics section, we can estimate input flow in bytes per second if we take some typical file sizes for pages scanned in color, grayscale, and black-and-white.

Images make up the majority of data transferred within the System. By analyzing the processing workflow, let’s define the 2 values:

  • the number R of stages where images of pages are downloaded from the Application Server;
  • the number W of stages where images of pages are uploaded to the Application Server.

The read-write speed requirements can be calculated as follows:

  • Required write speed = W x input flow in bytes per second.
  • Required read speed = R x input flow in bytes per second.

Example. A customer needs to process 10,000 grayscale pages per hour. The processing workflow includes 3 stages.

  1. A Processing Station downloads images from a hot folder, pre-recognizes and uploads them to the Application Server (W=1, R=0).
  2. Another Processing Station retrieves these images from the Application Server, performs recognition and the OCR results arrive at the Application Server (W=1, R=1).
  3. A verification operator downloads images and recognized data for checking and sends the verified data (W=1, R=2) back to the Server.
  4. Finally, a Processing Station downloads images and verified data, to send them to the customer’s back-end system (W=1, R=3).
    Assuming that the file size of average A4 grayscale scan is 3 MB, we have the following calculations:

Input flow = 10,000 grayscale page images/hour = 2.8 grayscale images/s = 8.4 MB/s.

Required write speed = 1 x 8.4 MB/s = 8.4 MB/s.

Required read speed = 3 x 8.4 MB/s = 25.2 MB/s.

To benchmark the performance of the hard disk you may use a CrystalDiskMark tool, distributed under MIT license.

2. The amount of time that documents are stored in the System.

Example. A customer needs to process 100,000 grayscale images in 24 hours. Under the Service-Level Agreement, processing time is 2 days per document. Processed documents are stored for 2 weeks because of the additional checks in the customer’s ERP system; in the event of any discrepancies, documents are edited in FlexiCapture and uploaded to the ERP system again.

Thus, images are to be stored for 2+14 = 16 days, and the System will accumulate 16 x 100,000 grayscale images x 3 MB (average file size for A4 grayscale image) = 4.8 TB of data.

Note: We strongly recommend using a fault-tolerant storage technology, e.g. RAID 10. Search indexing and anti-virus scanning of FileStorage contents may cause a decrease in performance or block access to files, which are processed in the System itself.

Large configuration is required when you deal with a significant volume (more than 300,000) of color pages. We declare its range is up to 3 million black-and-white pages or up to 1 million color pages in 24 hours.

All that is mentioned above about the Medium configuration remains valid for the Large configuration. The difference here is that you have to follow all optimization recommendations and pay special attention to each part of the system – to calculate the load and to choose hardware that is sufficiently powerful, yet not too expensive. Among other things, test the Internet connection and the backend connector to ensure they can operate at the desired performance level.

From the very beginning, consider using a 10 Gb/s network and a powerful FileStorage. Possible network architecture for the Large configuration is shown below.

Instead of providing typical system requirements for Large configurations, we recommend looking at the configurations that were tested, and their performance, as provided in this document.

To achieve even better performance, combine several independent FlexiCapture installations under one Administration and Monitoring point – referred to as the xLarge configuration – which is beyond the scope of this document.

4/12/2024 6:16:02 PM

Please leave your feedback about this article

Usage of Cookies. In order to optimize the website functionality and improve your online experience ABBYY uses cookies. You agree to the usage of cookies when you continue using this site. Further details can be found in our Privacy Notice.