English (English)

Chinese Simplified (简体中文)

Document Classification

The purpose of document classification is to assign documents to different predefined categories. This is very useful when dealing with a document flow that includes documents of various types, and you need to identify the type of each document. For example, you may want to sort the contracts, invoices, and receipts into different folders, or rename them according to their types. This can be done automatically with a pretrained system.

One of the main features of document classification is that you know in advance the types of documents you need to distinguish between. ABBYY FineReader Engine can classify documents on the basis of their content, the features of the image, or take into account both the characteristics of the recognized text and the image.

Let us consider the process in detail. It consists of two main steps:

Creating a classification database

For each category, choose several typical documents or pages. They will be used to create the classification database.

Classifying documents

The database that was created in the previous step can be used to classify documents. Incoming documents are fed to a pre-trained classification system which uses the classification database to determine the category.

You may also need to classify documents according to some of their attributes, such as author or barcode value. This article is not concerned with that type of classification. If you want to classify documents according to their attributes, you should implement your own algorithms, which can use text extraction, field-level recognition, or barcode recognition scenario for data extraction.

The procedure described below is also illustrated by a Classification demo tool.

Implementing the scenario

Below follows a detailed description of the recommended method of using ABBYY FineReader Engine to classify documents.

Step 1. Loading ABBYY FineReader Engine

To start your work with ABBYY FineReader Engine, you need to create the Engine object. The Engine object is the top object in the hierarchy of the ABBYY FineReader Engine objects and provides various global settings, some processing methods, and methods for creating the other objects.

To create the Engine object, you can use the InitializeEngine function. See also other ways to load Engine object.

C#

public class EngineLoader : IDisposable
{
    public EngineLoader()
    {
        // Initialize these variables with the full path to FREngine.dll, your Customer Project ID,
        // and, if applicable, the path to your Online License token file and the Online License password
        string enginePath = "";
        string customerProjectId = "";
        string licensePath = "";
        string licensePassword = "";
        // Load the FREngine.dll library
        dllHandle = LoadLibraryEx(enginePath, IntPtr.Zero, LOAD_WITH_ALTERED_SEARCH_PATH);
           
        try
        {
            if (dllHandle == IntPtr.Zero)
            {
                throw new Exception("Can't load " + enginePath);
            }
            IntPtr initializeEnginePtr = GetProcAddress(dllHandle, "InitializeEngine");
            if (initializeEnginePtr == IntPtr.Zero)
            {
                throw new Exception("Can't find InitializeEngine function");
            }
            IntPtr deinitializeEnginePtr = GetProcAddress(dllHandle, "DeinitializeEngine");
            if (deinitializeEnginePtr == IntPtr.Zero)
            {
                throw new Exception("Can't find DeinitializeEngine function");
            }
            IntPtr dllCanUnloadNowPtr = GetProcAddress(dllHandle, "DllCanUnloadNow");
            if (dllCanUnloadNowPtr == IntPtr.Zero)
            {
                throw new Exception("Can't find DllCanUnloadNow function");
            }
            // Convert pointers to delegates
            initializeEngine = (InitializeEngine)Marshal.GetDelegateForFunctionPointer(
                initializeEnginePtr, typeof(InitializeEngine));
            deinitializeEngine = (DeinitializeEngine)Marshal.GetDelegateForFunctionPointer(
                deinitializeEnginePtr, typeof(DeinitializeEngine));
            dllCanUnloadNow = (DllCanUnloadNow)Marshal.GetDelegateForFunctionPointer(
                dllCanUnloadNowPtr, typeof(DllCanUnloadNow));
            // Call the InitializeEngine function 
            // passing the path to the Online License file and the Online License password
            int hresult = initializeEngine(customerProjectId, licensePath, licensePassword, 
                "", "", false, ref engine);
            Marshal.ThrowExceptionForHR(hresult);
        }
        catch (Exception)
        {
            // Free the FREngine.dll library
            engine = null;
            // Deleting all objects before FreeLibrary call
            GC.Collect();
            GC.WaitForPendingFinalizers();
            GC.Collect();
            FreeLibrary(dllHandle);
            dllHandle = IntPtr.Zero;
            initializeEngine = null;
            deinitializeEngine = null;
            dllCanUnloadNow = null;
            throw;
        }
    }
    // Kernel32.dll functions
    [DllImport("kernel32.dll")]
    private static extern IntPtr LoadLibraryEx(string dllToLoad, IntPtr reserved, uint flags);
    private const uint LOAD_WITH_ALTERED_SEARCH_PATH = 0x00000008;
    [DllImport("kernel32.dll")]
    private static extern IntPtr GetProcAddress(IntPtr hModule, string procedureName);
    [DllImport("kernel32.dll")]
    private static extern bool FreeLibrary(IntPtr hModule);
    // FREngine.dll functions
    [UnmanagedFunctionPointer(CallingConvention.StdCall, CharSet = CharSet.Unicode)]
    private delegate int InitializeEngine(string customerProjectId, string licensePath, 
        string licensePassword, string tempFolder, string dataFolder, bool isSharedCPUCoresMode, 
        ref FREngine.IEngine engine);
    [UnmanagedFunctionPointer(CallingConvention.StdCall)]
    private delegate int DeinitializeEngine();
    [UnmanagedFunctionPointer(CallingConvention.StdCall)]
    private delegate int DllCanUnloadNow();
    // private variables
    private FREngine.IEngine engine = null;
    // Handle to FREngine.dll
    private IntPtr dllHandle = IntPtr.Zero;
    private InitializeEngine initializeEngine = null;
    private DeinitializeEngine deinitializeEngine = null;
    private DllCanUnloadNow dllCanUnloadNow = null;
}

Step 2. Creating ClassificationEngine

Step 3. Preparing the classification objects

The training and classification methods work with the special kind of object created from a document or page: ClassificationObject, containing all classification-relevant information.

To prepare a document for use in classification scenario, do the following:

Load the images for processing. There are several ways to do it: for example, you may create the FRDocument object with the help of the CreateFRDocument method of the Engine object, then add images to the created FRDocument object from file using the AddImageFile method.
If you are going to train or use a classifier of the type which takes into account text features (CT_Combined, CT_Text), first recognize the document with the help of any convenient method. We will use the Analyze and Recognize methods of the FRDocument object. Document synthesis is not necessary for classification.

Although parallel processing is not supported for classification itself, you may need it for preparatory recognition of the documents. If the number of documents you are going to classify is large, we recommend using Batch Processor or other parallel processing methods described in Parallel Processing with ABBYY FineReader Engine.

Use the CreateObjectFromDocument method of the ClassificationEngine object to create a ClassificationObject containing the information from the first page of the document. If you need to use another of the document's pages, call the CreateObjectFromPage method.
The Description property of the ClassificationObject is empty by default. Specify this property if you need a relevant description.

Note: It may sometimes happen that the recognized document or page nevertheless contains no recognized text (for example, if an empty page was used by mistake). In this case, the ClassificationObject could not be used for classifiers that require text features. You can use its SuitableClassifiers property for a double-check.

C#

// Create the FRDocument object
FREngine.IFRDocument frDocument = engine.CreateFRDocument();
// Add the images
frDocument.AddImageFile( "C:\\MyImage.tif", null, null );
// Optional: analyze and recognize the document
frDocument.Analyze( null, null, null );
frDocument.Recognize( null, null );
// Create the classification object
FREngine.IClassificationObject clObject = classEngine.CreateObjectFromDocument( frDocument );
// Let's put the category to which the object belongs into its description
clObject.Description = "CategoryA_Object1";

Step 4. Creating a training data set

To train a classifier that would distinguish between several types of documents, you need a categorized data set that contains samples of each type. Use the TrainingData object to populate and manage this data set:

Create an empty object with the help of the CreateTrainingData method of the ClassificationEngine object.
Access the collection of categories via its Categories property.
Use the AddNew method of the Categories object several times to add a category for each of the document types you intend to classify. The method requires a string with the category label as the input parameter. The label will be returned by the classification methods, so it must be unique in the categories set.
For each newly-added Category object, open the collection of classification objects using the Objects property. With the help of the IClassificationObjects::Add method, add the classification objects which correspond to this category.
No category may be left empty. For obvious reasons, at least two categories are required for training.
Now that you have configured the training data set, you may wish to save it into a file on disk for later use: for example, if the trained model accuracy proves unacceptable and you wish to add or correct some data for better quality. The TrainingData object provides the SaveToFile method.

C#

FREngine.ITrainingData trainingData = classEngine.CreateTrainingData();
FREngine.ICategories categories = trainingData.Categories;
// Add the first category
FREngine.ICategory category = categories.AddNew( "CategoryA" );
// Add the classification objects prepared on step 3
category.Objects.Add( clObject ); // repeat for all objects from this category
...
// Repeat for all categories
...
// After all categories have been added, save the training data set
trainingData.SaveToFile( "C:\\trainingData.dat" );

Step 5. Training the classification model

The functionality for model training is provided by the Trainer object. Use the CreateTrainer method of the ClassificationEngine object to create it.

It contains all settings for classifier type and training procedure, in two subobjects TrainingParams and ValidationParams. Decide which settings you need and change the corresponding properties:

The type of classifier (ITrainingParams::ClassifierType). This setting determines which features of the document are taken into account when assigning a category: image characteristics, contents of the recognized text, or both. To select a type which uses the text contents, you need to make sure all the classification objects in the training data set have been created from previously recognized documents.
The training mode (ITrainingParams::TrainingMode). This setting determines if the training process should favor high precision (how many of the selected elements are correct), high recall (how many of the correct elements are selected), or balance between the two.
If k-fold cross-validation should be used (IValidationParams::ShouldPerformValidation). We recommend using cross-validation when your training sample is not large, as it allows you to train several models on the different partitions of the same sample and select the best. If you have a large supply of categorized data, it may be best to turn the validation off, train the model on the whole training sample, then use the classification methods (Step 6) to test the model on another sample, calculating the performance scores on your side.
The parameters of k-fold cross-validation: the number of parts into which the training sample is divided (IValidationParams::FoldsCount) and the number of iterations (IValidationParams::RepeatCount). Note that the required number of objects in the training set on each iteration is not less than 4 for the text classifier and not less than 8 for the combined classifier. Make sure that your training sample contains enough objects.

Now you are ready to train a model. Pass the TrainingData object you configured on Step 4 to the TrainModel method of the Trainer object. It returns a TrainingResults collection, which with the currently available functionality only contains one TrainingResult. If you chose to perform cross-validation, review the performance scores in its ValidationResult subobject.

Note: Model training and classification will be performed in sequential mode, regardless of the IMultiProcessingParams::MultiProcessingMode value.

The ITrainingResult::Model property provides access to the trained classification model. You may save it into a file with the help of the SaveToFile method or use it directly to classify some documents (proceed to Step 6).

C#

// Create the trainer object and set up parameters
FREngine.ITrainer trainer = classEngine.CreateTrainer();
trainer.TrainingParams.ClassifierType = (int)FREngine.ClassifierTypeEnum.CT_Image; // the classifier will only use the image features
// We will leave the other settings default and train the model directly
FREngine.ITrainingResults results = trainer.TrainModel ( trainingData );
// Check the model's F1 score
double F1 = results[0].ValidationResult.FMeasure;
// Retrieve the classification model
FREngine.IModel model = results[0].Model;
// Save the model for later use
model.SaveToFile( "C:\\model.dat" );

Step 6. Classifying documents

Step 7. Unloading ABBYY FineReader Engine

After finishing your work with ABBYY FineReader Engine, you need to unload the Engine object. To do this, use the DeinitializeEngine exported function.

C#

public class EngineLoader : IDisposable
{
    // Unload FineReader Engine
    public void Dispose()
    {
        if (engine == null)
        {
            // Engine was not loaded
            return;
        }
        engine = null;
        // Deleting all objects before FreeLibrary call
        GC.Collect();
        GC.WaitForPendingFinalizers();
        GC.Collect();
        int hresult = deinitializeEngine();

        hresult = dllCanUnloadNow();
        if (hresult == 0)
        {
            FreeLibrary(dllHandle);
        }
        dllHandle = IntPtr.Zero;
        initializeEngine = null;
        deinitializeEngine = null;
        dllCanUnloadNow = null;
        // throwing exception after cleaning up
        Marshal.ThrowExceptionForHR(hresult);
    }
    // Kernel32.dll functions
    [DllImport("kernel32.dll")]
    private static extern IntPtr LoadLibraryEx(string dllToLoad, IntPtr reserved, uint flags);
    private const uint LOAD_WITH_ALTERED_SEARCH_PATH = 0x00000008;
    [DllImport("kernel32.dll")]
    private static extern IntPtr GetProcAddress(IntPtr hModule, string procedureName);
    [DllImport("kernel32.dll")]
    private static extern bool FreeLibrary(IntPtr hModule);
    // FREngine.dll functions
    [UnmanagedFunctionPointer(CallingConvention.StdCall, CharSet = CharSet.Unicode)]
    private delegate int InitializeEngine( string customerProjectId, string LicensePath, string LicensePassword, , , , ref FREngine.IEngine engine);
    [UnmanagedFunctionPointer(CallingConvention.StdCall)]
    private delegate int DeinitializeEngine();
    [UnmanagedFunctionPointer(CallingConvention.StdCall)]
    private delegate int DllCanUnloadNow();
    // private variables
    private FREngine.IEngine engine = null;
    // Handle to FREngine.dll
    private IntPtr dllHandle = IntPtr.Zero;
    private InitializeEngine initializeEngine = null;
    private DeinitializeEngine deinitializeEngine = null;
    private DllCanUnloadNow dllCanUnloadNow = null;
}

Required resources

You can use the FREngineDistribution.csv file to automatically create a list of files required for your application to function. For processing with this scenario, select in the column 5 (RequiredByModule) the following values:

Core

Core.Resources

Opening

Opening, Processing

Processing

Processing.Classification

Processing.Classification.NaturalLanguages

Processing.OCR

Processing.OCR, Processing.ICR

Processing.OCR.NaturalLanguages

Processing.OCR.NaturalLanguages, Processing.ICR.NaturalLanguages

If you modify the standard scenario, change the required modules accordingly. You also need to specify the interface languages, recognition languages and any additional features which your application uses (such as, e.g., Opening.PDF if you need to open PDF files, or Processing.OCR.CJK if you need to recognize texts in CJK languages). See Working with the FREngineDistribution.csv File for further details.

Additional optimization

You can find more information about setting up the various processing stages in these articles:

Loading Engine

Different Ways to Load the Engine Object
Describes the ways of loading the Engine object in detail.
Using ABBYY FineReader Engine in Multi-Threaded Server Applications
Discusses the specifics of using FineReader Engine in server applications.

Opening and preprocessing images

Image Preprocessing
Describes a scenario where ABBYY FineReader Engine is used to preprocess images.

Recognition

Parallel Processing with ABBYY FineReader Engine
To quickly prepare the recognized documents or pages for a classifier with text features, use parallel processing for recognition and then turn multiprocessing off for classification.

Your use of this site is conditioned on Your continued compliance with the Terms of Use.

Terms of Use

Disclaimer of Warranty

Limitation of Liability

Transmission and Submission of Information

Downloads

Use of Content

Trademarks

Links to Third-Party Sites

Foreign Legislation

Subscription Terms

Partner Subscription Terms