The purpose of document classification is to assign documents to different predefined categories. This is very useful when dealing with a document flow that includes documents of various types, and you need to identify the type of each document. For example, you may want to sort the contracts, invoices, and receipts into different folders, or rename them according to their types. This can be done automatically with a pretrained system.
One of the main features of document classification is that you know in advance the types of documents you need to distinguish between. ABBYY FineReader Engine can classify documents on the basis of their content, the features of the image, or take into account both the characteristics of the recognized text and the image.
Let us consider the process in detail. It consists of two main steps:
- Creating a classification database
For each category, choose several typical documents or pages. They will be used to create the classification database.
- Classifying documents
The database that was created in the previous step can be used to classify documents. Incoming documents are fed to a pre-trained classification system which uses the classification database to determine the category.
You may also need to classify documents according to some of their attributes, such as author or barcode value. This article is not concerned with that type of classification. If you want to classify documents according to their attributes, you should implement your own algorithms, which can use text extraction, field-level recognition, or barcode recognition scenario for data extraction.
The procedure described below is also illustrated by a Classification code sample.
Implementing the scenario
Below follows a detailed description of the recommended method of using ABBYY FineReader Engine to classify documents.
Step 1. Loading ABBYY FineReader Engine
To start your work with ABBYY FineReader Engine, you need to create the Engine object. The Engine object is the top object in the hierarchy of the ABBYY FineReader Engine objects and provides various global settings, some processing methods, and methods for creating the other objects.
To create the Engine object, you can use the InitializeEngine function.
Step 2. Creating ClassificationEngine
Step 3. Preparing the classification objects
The training and classification methods work with the special kind of object created from a document or page: ClassificationObject, containing all classification-relevant information.
To prepare a document for use in classification scenario, do the following:
- Load the images for processing. There are several ways to do it: for example, you may create the FRDocument object with the help of the CreateFRDocument method of the Engine object, then add images to the created FRDocument object from file using the AddImageFile method.
- If you are going to train or use a classifier of the type which takes into account text features (CT_Combined, CT_Text), first recognize the document with the help of any convenient method. We will use the Analyze and Recognize methods of the FRDocument object. Document synthesis is not necessary for classification.
Although parallel processing is not supported for classification itself, you may need it for preparatory recognition of the documents. If the number of documents you are going to classify is large, we recommend using Batch Processor or other parallel processing methods described in Parallel Processing with ABBYY FineReader Engine.
- Use the CreateObjectFromDocument method of the ClassificationEngine object to create a ClassificationObject containing the information from the first page of the document. If you need to use another of the document's pages, call the CreateObjectFromPage method.
- The Description property of the ClassificationObject is empty by default. Specify this property if you need a relevant description.
ClassificationObjectSuitableClassifiers
Step 4. Creating a training data set
To train a classifier that would distinguish between several types of documents, you need a categorized data set that contains samples of each type. Use the TrainingData object to populate and manage this data set:
- Create an empty object with the help of the CreateTrainingData method of the ClassificationEngine object.
- Access the collection of categories via its Categories property.
- Use the AddNew method of the Categories object several times to add a category for each of the document types you intend to classify. The method requires a string with the category label as the input parameter. The label will be returned by the classification methods, so it must be unique in the categories set.
- For each newly-added Category object, open the collection of classification objects using the Objects property. With the help of the IClassificationObjects::Add method, add the classification objects which correspond to this category.
No category may be left empty. For obvious reasons, at least two categories are required for training.
- Now that you have configured the training data set, you may wish to save it into a file on disk for later use: for example, if the trained model accuracy proves unacceptable and you wish to add or correct some data for better quality. The TrainingData object provides the SaveToFile method.
Step 5. Training the classification model
The functionality for model training is provided by the Trainer object. Use the CreateTrainer method of the ClassificationEngine object to create it.
It contains all settings for classifier type and training procedure, in two subobjects TrainingParams and ValidationParams. Decide which settings you need and change the corresponding properties:
- The type of classifier (ITrainingParams::ClassifierType). This setting determines which features of the document are taken into account when assigning a category: image characteristics, contents of the recognized text, or both. To select a type which uses the text contents, you need to make sure all the classification objects in the training data set have been created from previously recognized documents.
- The training mode (ITrainingParams::TrainingMode). This setting determines if the training process should favor high precision (how many of the selected elements are correct), high recall (how many of the correct elements are selected), or balance between the two.
- If k-fold cross-validation should be used (IValidationParams::ShouldPerformValidation). We recommend using cross-validation when your training sample is not large, as it allows you to train several models on the different partitions of the same sample and select the best. If you have a large supply of categorized data, it may be best to turn the validation off, train the model on the whole training sample, then use the classification methods (Step 6) to test the model on another sample, calculating the performance scores on your side.
- The parameters of k-fold cross-validation: the number of parts into which the training sample is divided (IValidationParams::FoldsCount) and the number of iterations (IValidationParams::RepeatCount). Note that the required number of objects in the training set on each iteration is not less than 4 for the text classifier and not less than 8 for the combined classifier. Make sure that your training sample contains enough objects.
Now you are ready to train a model. Pass the TrainingData object you configured on Step 4 to the TrainModel method of the Trainer object. It returns a TrainingResults collection, which with the currently available functionality only contains one TrainingResult. If you chose to perform cross-validation, review the performance scores in its ValidationResult subobject.
IMultiProcessingParams::MultiProcessingMode
The ITrainingResult::Model property provides access to the trained classification model. You may save it into a file with the help of the SaveToFile method or use it directly to classify some documents (proceed to Step 6).
Step 6. Classifying documents
To use the trained model for classification:
- If the model is not currently loaded, call the CreateModelFromFile method of the ClassificationEngine object to load it from a file on disk.
- Prepare the classification objects from the documents you need to classify, as described in Step 3.
- For each classification object, call the Classify method of the Model object with the ClassificationObject as the input parameter. The method returns a collection of ClassificationResult objects, each containing the category label and the probability for this category. The results are sorted by probability from best to worst. Retrieve the result and check that the probability level is acceptable to you.
If the classifier was unable to assign a category, null is returned instead of the results collection.
IMultiProcessingParams::MultiProcessingMode
Step 7. Unloading ABBYY FineReader Engine
After finishing your work with ABBYY FineReader Engine, you need to unload the Engine object. To do this, use the DeinitializeEngine exported function.
Required resources
You can use the FREngineDistribution.csv file to automatically create a list of files required for your application to function. For processing with this scenario, select in the column 5 (RequiredByModule) the following values:
Core
Core.Resources
Opening
Opening, Processing
Processing
Processing.Classification
Processing.Classification.NaturalLanguages
Processing.OCR
Processing.OCR, Processing.ICR
Processing.OCR.NaturalLanguages
Processing.OCR.NaturalLanguages, Processing.ICR.NaturalLanguages
If you modify the standard scenario, change the required modules accordingly. You also need to specify the interface languages, recognition languages and any additional features which your application uses (such as, e.g., Opening.PDF if you need to open PDF files, or Processing.OCR.CJK if you need to recognize texts in CJK languages). See Working with the FREngineDistribution.csv File for further details.
Additional optimization
You can find more information about setting up the various processing stages in these articles:
- Opening and preprocessing images
- Image Preprocessing
Describes a scenario where ABBYY FineReader Engine is used to preprocess images.
- Recognition
See also
Basic Usage Scenarios Implementation