Document Classification Training
Document classification assigns a document to a predefined category before extracting the data. This helps match Document Definitions faster and improve data extraction quality. For best classification results, use classification training.
Prepare for training
The batch for classification training should contain enough images of each document class for the classifier to learn the difference between the classes. In addition, you need to set up the mapping between classes and sections of the Document Definitions. This will allow the trained classifier to apply the correct Document Definition to each image.
- Create a new classification training batch using the AddNew method of the ClassificationTrainingBatches object.
- Add the images for classification training. Convert the training batch to a working batch using the AsBatch property of the ClassificationTrainingBatch object and then add images.
Important! Make sure that all the setting up described in the following steps is done within the same Open/Close batch session. Once you close the working batch, the PageClass object is no longer available. - Add new page classes to the PageClasses collection of the batch.
- Assign page classes and training states to the images. For each document in the ClassificationTrainingDocuments collection of the classification training batch:
- Iterate through its ClassificationTrainingPages collection and assign the page classes you've just set up to the ReferenceClass property of each ClassificationTrainingPage object.
- Set the TrainingState property of each ClassificationTrainingDocument to DTS_ForTraining or DTS_ForTesting. We recommend a 70:30 ratio of training to testing images for each class.
- Link each page class to a Document Definition section. In non-invoice projects, you can also link a page class to a section variant.
- Get the section definition from the Document Definition using its Sections property.
- Find the section variant you need in the "Variants" data set using its LookUp method.
- Create links using the AddNew method of the LinksToSections collection of each PageClass object.
C# code
Train the classifier
During training, the classifier attempts to learn the differences between the images that belong to different classes. Since training may decrease classification quality in some cases, it's up to you to decide whether the classifier should be updated when the training is completed. You can add more documents to the training batch and run training again, examine the validation results and update the classifier when needed.
- Get the classification trainer object using the ClassificationTrainer property of the ClassificationTrainingBatch object.
- Call the CheckOut method of the ClassificationTrainer object to switch the classifier to training mode. Set up classification trainer parameters using its ClassificationTrainingParams property.
- Perform training by calling the Train method.
- Examine the training results. From the TrainingResults object returned from the previous call, you can access the CandidateResults and the CurrentResults properties. The CandidateResults refers to the ValidationResults object with statistics for the newly trained classifier. The CurrentResults object contains the statistics for the previously used classifier.
- Compare the statistics and decide whether you want to use the newly trained classifier or not. You can also consult the ShouldApply property of the TrainingResults object. This boolean property is an integral estimate of whether the new classifier is "better" than the old one. Depending on your decision, call one of the two methods:
- The CheckIn method of the ClassificationTrainer object — if you choose to apply the new classifier, this method will publish the classifier and switch off the training mode.
- The UndoCheckout method of the ClassificationTrainer object — if you prefer the previously used classifier, this method will discard the changes and switch off the training mode.
C# code
Use the classifier
Document recognition uses the classifier automatically to match the Document Definitions faster. To enable this behavior, you should set the working batch to use the classifier from the right classification training batch.
- Create a new batch type and link it to the classifier:
- Create a BatchTypeParams object using the CreateBatchTypeParams method of the Project object.
- Set the UseClassifier property of the BatchTypeParams object to TRUE, so that all batches of this type will use classifier.
- Link the BatchTypeParams object to the classification training batch by setting its ClassificationBatch property.
Note: You can also specify text recognition parameters using the FullTextRecognitionParams property. - Create a batch type: call the AddNew method of the BatchTypes object and pass the BatchTypeParams you have just configured.
- Create a working batch using the AddNewEx method of the Batches collection and specify the batch type you have just created. Add the images to the working batch.
- Process the images in the batch using the Recognize method. During recognition, the Document Definitions will be matched with the help of the classifier.
Note: To check which Document Definition has been matched to a document, you can use the DocumentDefinition property of the Document object.
C# code
Samples
See the Classification code sample for an implementation of this scenario.
See also
15.08.2023 13:19:30