NLP Model Training

ABBYY FlexiCapture 12 SDK uses NLP (Natural Language Processing) models to extract information from unstructured texts. NLP models trained on your sample documents determine what entities and segments should be extracted from a document, which allows for the information that you need to be extracted more efficiently.

You can create and train one or more NLP models depending on the variety of your documents, as well as the complexity and amount of the information that you need to extract. A segmentation model uses the text of the entire section to find the required paragraphs, while an extraction model finds entities in text fragments.

Preparing for training

NLP technology is used in ABBYY FlexiCapture SDK as a part of a Document Definition, so you need to create a new Document Definition or get it from the collection.

Get the Document Definition from the DocumentDefinitions collection of your Project. Switch the Document Definition to editing mode by calling the CheckOut method of the DocumentDefinitions object and get the sections of the Document Definition using the Sections property of the DocumentDefinition object.
Obtain the NLP models collection using the NlpModels property of the SectionDefinition object.
Get the collection of the extraction models using the ExtractionModels property of the NlpModels object or get the handler of the segmentation model using SegmentationModelHandler property of the NlpModels object. If you have an existing extraction model, you can load it using the LoadFromFile method of the ExtractionModels object.
Or you can create a new one:

To extract entities, you need an entity extraction model. Call the AddNew method of the ExtractionModels object.
If you want to improve the accuracy and speed of entity extraction using segmentation, create a segmentation model by calling AddNewSegmentationModel method of the SegmentationModelHandler object.

For both models:

Make sure the training flag (AllowTraining property) is set to TRUE if you want to train your NLP model.
Specify the field from which to extract the entity using the Source property.
Note: For a segmentation model, the value of the Source property is a specific Document Definition section to which the segmentation model belongs.
Specify the language of the text to be extracted using the Language property.
Specify the output fields using the ResultFields property. The output fields will contain information extracted by the model from your documents.

Call the CheckIn method of the DocumentDefinitions object.

C# code

Training NLP models

Once your NLP model has been loaded or created, you need to train it on some sample documents.

Add a new training batch using the AddNew method of the FieldsExtractionTrainingBatches object.
Set the value of the IsNlpTrainingBatch property of the FieldsExtractionTrainingBatch object to TRUE to define a training batch as a batch for NLP training.
Add images for NLP model training to the batch. Call the Recognize method of the Batch object to match the Document Definition and recognize all the documents in the batch.
Mark up the layout of the fields that will be trained. Assign the correct region to each field by calling the AddNew method of the FieldRegions object.
Train field extraction. Get the FieldsExtractionTrainer object from the training batch and call its Train method.

C# code

You can now use the trained NLP model to process your documents.

Important! To recognize documents using NLP technology, the NLP module must be installed. If no NLP components are found, you cannot use the NLP models.