NLP model training based on feedback from verification operators
The quality of data extraction can be improved through additional training of NLP models by operators. If the program fails to detect certain fields or mistakes one field for another, the verification operator can indicate the correct field and retrain the NLP model. The program will then use the retrained model for more accurate data extraction.
注： Additional training is not available for NLP models loaded into Document Definitions.
There are two ways to initiate the training of an NLP model during verification. You can:
- Add a training stage after the verification stage. Training will start when the conditions specified for the training batch are met. For more information about setting up workflow stages, see Workflow setup.
- Manually send documents to the training stage. To do this, right-click the document in the working batch and select 学習 on the shortcut menu.
Generally, the training procedure is as follows:
- When training is initiated, ABBYY FlexiCapture automatically creates a generic training batch in the list of training batches (if it does not contain one already). All documents related to a specific Document Definition will be copied into this batch, regardless of their variant.
- Each document is assigned either the 学習用 or the テスト用 status.
- Documents marked 学習用 undergo training. As a result, a new NLP model is created.
- The new model that has been created during training is then tested using documents marked テスト用.
- If the overall performance of the new model is not worse than that of the existing model, the existing model will be replaced with the new one. Otherwise, the new model will be rejected.
During document processing, it may be discovered that for some documents the field locations differ significantly, even though they contain identical sets of fields. In order to improve the recognition quality of such documents, create separate training batches for each document variant.
Creating a field extraction training batch for a specific vendor or variant
In order to train documents that come from a specific vendor or belong to a particular variant, a new batch must be created. Proceed as follows:
- On the Project Setup Station, open the project with the NLP model. For more information about setting up an NLP model, see Creating NLP models.
- Navigate to フィールド抽出用トレーニングバッチ by selecting Fields Training > Open Field Extraction Training Batches. Alternatively, you can either use the Ctrl + Alt + B key combination, or select フィールド抽出用トレーニングバッチ on the shortcut menu.
- Create a new batch by selecting File > 新規バッチ. Alternatively, you can use the Ctrl + N key combination. Choose the appropriate Document Definition and variant and then select the NLP Batch option on the shortcut menu.
- Add your documents, recognize them, edit the order of sections, and start the training by selecting 学習 on the shortcut menu. Alternatively, you can either use the Ctrl + F7 key combination or click the Train Batch button on the toolbar.
The quality of a trained NLP model depends on the number of documents in the training batch and the quality of their markup. Please note the following:
- All the fields described by the Document Definition should be marked up in the training documents.
- It is recommended to have between 100 and 500 documents in each training batch. This number of documents will enable the program to select the best parameters for your NLP model without slowing down the training process.
When operator feedback is used for training, new documents will be added to both the training batch and the variant batch.
- For a variant with an existing training batch, the NLP model created for that particular batch will be used.
- For all other variants, the NLP model created for the generic training batch will be used.
If a document identical to one already present in a training batch is added to it from the same source, the new document will replace the older one. This will also be recorded in the background task log for the training task. The program uses the document registration parameters to decide whether a document is a copy of an already existing document or not.
After creating the batch, you can specify additional options. To do this, select NLP バッチ設定を表示....
The following additional options can be specified in the トレーニングバッチ設定 dialog box:
If the maximum number of documents is reached, any new documents added into a training batch will replace the old documents.
Indicates the percentage of old documents that can be replaced with new ones during one training session. Documents that have been sent to the training stage but were not included in the batch will not be used to train the new NLP model.
- バッチに含まれる数が __ 新しい文書より多い場合、または __ 新しい文書の %以上の場合トレーニングを開始
Training will start when at least one of the following is true: the number of new documents added into a training batch is greater than the specified value; the percentage of new documents relative to the total number of documents in a batch is equal to or greater than the specified value. Otherwise, training will not start, and an entry will be added into the background task log saying that there are not enough new documents to start training.
Specifies the percentage of documents marked テスト用 and 学習用. For example, if you limit the percentage of "For training" documents to 70%, the remaining 30% will be marked "For testing".
Once training is finished, statistics for an NLP model can be exported. This includes the following:
- Information about the training batch settings.
- Information about both the new and the old NLP models.
- Training time.
- The version of the NLP component used to train the NLP model.
- Document and field training statistics.
- Information about how recent the exported data is.
If the isActual parameter is false, the batch was modified after the training and creation of a new NLP model: documents may have been added or removed, document markup might have changed, etc. For up-to-date statistics, training should be launched again.
To export the log for a training batch, right-click the batch, click Export Field Extraction Statistics... on the shortcut menu, and specify where you want to save the CSV file.
1/16/2023 10:03:07 AM