Document Classification

The purpose of document classification is to assign documents to different predefined categories. This is very useful when dealing with a document flow that includes documents of various types, and you need to identify the type of each document. For example, you may want to sort the contracts, invoices, and receipts into different folders, or rename them according to their types. This can be done automatically with a pretrained system.

One of the main features of document classification is that you know in advance the types of documents you need to distinguish between. ABBYY FineReader Engine can classify documents on the basis of their content, the features of the image, or take into account both the characteristics of the recognized text and the image.

Let us consider the process in detail. It consists of two main steps:

  1. Creating a classification database

For each category, choose several typical documents or pages. They will be used to create the classification database.

  1. Classifying documents

The database that was created in the previous step can be used to classify documents. Incoming documents are fed to a pre-trained classification system which uses the classification database to determine the category.

You may also need to classify documents according to some of their attributes, such as author or barcode value. This article is not concerned with that type of classification. If you want to classify documents according to their attributes, you should implement your own algorithms, which can use text extraction, field-level recognition, or barcode recognition scenario for data extraction.

The procedure described below is also illustrated by a Classification code sample.

Implementing the scenario

Below follows a detailed description of the recommended method of using ABBYY FineReader Engine to classify documents.

Step 1. Loading ABBYY FineReader Engine

Step 2. Creating ClassificationEngine

Step 3. Preparing the classification objects

Step 4. Creating a training data set

Step 5. Training the classification model

Step 6. Classifying documents

Step 7. Unloading ABBYY FineReader Engine

Required resources

You can use the FREngineDistribution.csv file to automatically create a list of files required for your application to function. For processing with this scenario, select in the column 5 (RequiredByModule) the following values:

Core

Core.Resources

Opening

Opening, Processing

Processing

Processing.Classification

Processing.Classification.NaturalLanguages

Processing.OCR

Processing.OCR, Processing.ICR

Processing.OCR.NaturalLanguages

Processing.OCR.NaturalLanguages, Processing.ICR.NaturalLanguages

If you modify the standard scenario, change the required modules accordingly. You also need to specify the interface languages, recognition languages and any additional features which your application uses (such as, e.g., Opening.PDF if you need to open PDF files, or Processing.OCR.CJK if you need to recognize texts in CJK languages). See Working with the FREngineDistribution.csv File for further details.

Additional optimization

You can find more information about setting up the various processing stages in these articles:

  • Opening and preprocessing images
    • Image Preprocessing
      Describes a scenario where ABBYY FineReader Engine is used to preprocess images.

See also

Basic Usage Scenarios Implementation

03.07.2024 8:50:10

Usage of Cookies. In order to optimize the website functionality and improve your online experience ABBYY uses cookies. You agree to the usage of cookies when you continue using this site. Further details can be found in our Privacy Notice.