Recognizing Invoices with ABBYY FlexiCapture SDK
As flexible documents, invoices have fields whose design, number, and layout may vary significantly from document to document. You can use ABBYY FlexiCapture SDK to extract information from invoices.
Using a preconfigured FlexiCapture invoice project
ABBYY FlexiCapture SDK provides several preconfigured FlexiCapture invoice projects containing Document Definitions optimized for invoices from different countries. You can find the sample projects in SampleInvoiceProjects in the Samples folder (Start > Programs > ABBYY FlexiCapture 12 SDK > Code Samples Folder). Alternatively, you can create an empty project with predefined countries and a Document Definition configured for one of the countries:
- Australia or New Zealand
- Canada
- European Union
- Japan
- Spain
- United States of America
You can also use a FlexiCapture invoice project created and set up in FlexiCapture Developer's Package. In this case, see help.abbyy.com for reference.
Setting up a new project
Invoice projects are set up the same way as ordinary FlexiCapture projects. However, you may want to do some additional setting for the best recognition quality.
- Create an empty invoice project using the CreateInvoiceProject method or open an existing one using the OpenProject method of the Engine object (see also Opening the Project).
- Load the business units and vendors data sets.
- Get the DocumentDefinition from the DocumentDefinitions collection of your Project. Switch the Document Definition to editing mode by calling the CheckOut method of the DocumentDefinitions object.
- Use the Find method of the DataSets object to retrieve the "Vendors" and "BusinessUnits" data sets.
- Populate the data sets with the information about your vendors and business units. Use either the AddNew method and the Field property of the DataSetTableRecords object to add records one by one, or fill the data set in a single operation with the UpdateFromDB method.
Note: Using UpdateFromDB requires a configured connection to your external database. You can set up the connection in FlexiCapture Developer's Package. - Save the changes you made to the data sets by calling the Commit method of the DataSetTableRecords object. Call the CheckIn method of the DocumentDefinitions object.
Note: Since Release 1 Update 4, some information about vendors and business units will be detected even if no database is connected. We use machine learning to detect company names and addresses and recognize tax IDs and IBANs because of their distinctive formats.
- You may also wish to specify additional options for best results.
Extracting data from invoices
- Create a batch and add images to it.
- Recognize the images. During recognition, the program will apply the Document Definition and extract the field values.
- Work with the extracted data. You can go through all the fields that were found and retrieve their values (see Working with Recognized Data). If you find that a field was detected incorrectly, you can set the correct region and train the recognizer to extract this field more accurately in future (see Fields Extraction Training).
- Export the results.
C# code
Processing invoices without data sets
The process of capturing invoices can be carried out without "Vendors" and "BusinessUnits" data sets. In this case, you need to train the fields using the clustering feature.
- Create an empty invoice project using the CreateInvoiceProject method or open a project without data sets using the OpenProject method of the Engine object.
- Enable clustering feature using the InvoiceFeatures property of the InvoiceSettings object.
- Add a new training batch using the AddNew method of the FieldsExtractionTrainingBatches object. Pass NULL as a value of the Record parameter instead of the vendor record.
- Add some documents for training. On each page, find the fields that you are going to train, for example, the Total field. Note that you cannot train the vendor and business unit fields.
- Get the FieldsExtractionTrainer object using the FieldsExtractionTrainer property of the FieldsExtractionTrainingBatch object.
- Train fields extraction by calling the Train method of the trainer object.
- Now you can recognize the documents from the working batch similar to those used in training calling the Recognize method of the Batch object. The fields you have trained should be detected and extracted more precisely.
Additional options
Countries and languages
A large list of predefined countries and languages is available for invoices recognition, but if you do not find the country you need in the list, you can add it. Use the ICountries::AddNew method, then specify the country name and ISO alpha-2 code in the Name and Alpha2Code properties of the new Country object. Add the languages which are used in the invoices originating in this country with the help of the AddRecognitionLanguage method. If the language has not been already present in the InvoiceLanguages collection, it will be added there as well, and you will be able to edit its invoice-related properties, such as keyword lists. You can also set up a default country from a list of predefined countries in case the country is not automatically detected. To do this, use the DefaultCountry property of the InvoiceSettings object.
C# code
Keywords
Every predefined country and language for invoice recognition contain keywords for specific fields. Once a keyword is recognized on the image, the processing engine searches for the corresponding field nearby. If the keywords found on your invoices are not listed, we recommend adding them for the better quality of field extraction. The country-specific keywords can be filled in various properties of the Country object, while the language-specific keywords are accessed via the KeywordLists property of the InvoiceLanguage object. The set of keyword lists is predefined, as they are connected to available invoice fields.
C# code
Additional fields
Some of the additional fields that can be recognized on an invoice are not turned on by default: for example, delivery date. You can manage these with the help of the InvoiceFeatures object.
C# code
Line items
In addition to capturing data from the main fields of an invoice (number, date, vendor, business unit, etc.) you may need to capture data from repeatable fields too, i.e. fields with several instances. These fields occur several times in the invoice. In ABBYY FlexiCapture SDK a group of repeatable fields of the invoice are represented by line items. To access the recognized data of the line items, you should:
- Access the instances of the field using the Instances property of the corresponding Field object.
- Get child fields of the instance using the Children property of the Field object.
- Now we can get the value of the specified field using the Value property of the Field object.
Note: To work with the line items, you must turn on the Line item extraction feature with the help of the InvoiceFeatures object.
During line items verification, use the ContinueLineItems method of the Document object when many line items together were extracted incorrectly. Have the operator to mark up the first line item manually. Then call this method to detect the other line items using the same pattern. Note that this is only possible if the country of vendor or purchaser was detected.
See also
15.08.2023 13:19:30