NLP model training based on feedback from verification operators

The quality of data extraction can be improved through additional training of NLP models by operators. If the program fails to detect certain fields or mistakes one field for another, the verification operator can indicate the correct field and retrain the NLP model. The program will then use the retrained model for more accurate data extraction.

Note: Additional training is not available for NLP models loaded into Document Definitions.

There are two ways to initiate the training of an NLP model during verification. You can:

  • Add a training stage after the verification stage. Training will start when the conditions specified for the training batch are met. For more information about setting up workflow stages, see Workflow setup.
  • Manually send documents to the training stage. To do this, right-click the document in the working batch and select Train on the shortcut menu.

Generally, the training procedure is as follows:

  • When training is initiated, ABBYY FlexiCapture automatically creates a generic training batch in the list of training batches (if it does not contain one already). All documents related to a specific Document Definition will be copied into this batch, regardless of their variant.
  • Each document is assigned either the For training or the For testing status.
  • Documents marked For training undergo training. As a result, a new NLP model is created.
  • The new model that has been created during training is then tested using documents marked For testing.
  • If the overall performance of the new model is not worse than that of the existing model, the existing model will be replaced with the new one. Otherwise, the new model will be rejected.

During document processing, it may be discovered that for some documents the field locations differ significantly, even though they contain identical sets of fields. In order to improve the recognition quality of such documents, create separate training batches for each document variant.

 

Creating a field extraction training batch for a specific vendor or variant

In order to train documents that come from a specific vendor or belong to a particular variant, a new batch must be created. Proceed as follows:

  1. On the Project Setup Station, open the project with the NLP model. For more information about setting up an NLP model, see Creating NLP models.
  2. Navigate to Field Extraction Training Batches by selecting Fields Training > Open Field Extraction Training Batches. Alternatively, you can either use the Ctrl + Alt + B key combination, or select Field Extraction Training Batches on the shortcut menu.
  3. Create a new batch by selecting File > New Batch. Alternatively, you can use the Ctrl + N key combination. Choose the appropriate Document Definition and variant and then select the NLP Batch option on the shortcut menu.
  4. Add your documents, recognize them, edit the order of sections, and start the training by selecting Train on the shortcut menu. Alternatively, you can either use the Ctrl + F7 key combination or click the Train Batch button on the toolbar.

The quality of a trained NLP model depends on the number of documents in the training batch and the quality of their markup. Please note the following:

  • All the fields described by the Document Definition should be marked up in the training documents.
  • It is recommended to have between 100 and 500 documents in each training batch. This number of documents will enable the program to select the best parameters for your NLP model without slowing down the training process.

When operator feedback is used for training, new documents will be added to both the training batch and the variant batch.

  • For a variant with an existing training batch, the NLP model created for that particular batch will be used.
  • For all other variants, the NLP model created for the generic training batch will be used.

If a document identical to one already present in a training batch is added to it from the same source, the new document will replace the older one. This will also be recorded in the background task log for the training task. The program uses the document registration parameters to decide whether a document is a copy of an already existing document or not.

After creating the batch, you can specify additional options. To do this, select  Show NLP Batch Settings....

The following additional options can be specified in the Training Batch Settings dialog box:

  • Maximum documents in each training batch
    If the maximum number of documents is reached, any new documents added into a training batch will replace the old documents.
  • Maximum percentage of replaced documents
    Indicates the percentage of old documents that can be replaced with new ones during one training session. Documents that have been sent to the training stage but were not included in the batch will not be used to train the new NLP model.
  • Start training if batch contains more than __ new documents or more than __ % of new documents
    Training will start when at least one of the following is true: the number of new documents added into a training batch is greater than the specified value; the percentage of new documents relative to the total number of documents in a batch is equal to or greater than the specified value. Otherwise, training will not start, and an entry will be added into the background task log saying that there are not enough new documents to start training.
  • Percentage of documents to be used for training
    Specifies the percentage of documents marked For testing and  For training. For example, if you limit the percentage of "For training" documents to 70%, the remaining 30% will be marked "For testing".

 

Training statistics

Once training is finished, statistics for an NLP model can be exported. This includes the following:

  • Information about the training batch settings.
  • Information about both the new and the old NLP models.
  • Training time.
  • The version of the NLP component used to train the NLP model.
  • Document and field training statistics.
  • Information about how recent the exported data is.
    If the isActual parameter is false, the batch was modified after the training and creation of a new NLP model: documents may have been added or removed, document markup might have changed, etc. For up-to-date statistics, training should be launched again.

To export the log for a training batch, right-click the batch, click  Export Field Extraction Statistics... on the shortcut menu, and specify where you want to save the CSV file.  

01.12.2020 7:03:59


Please leave your feedback about this article