Detecting the main fields
This article describes how the main fields of an invoice are detected and captured.
The program starts processing an invoice by recognizing its text in accordance with the Document Definition settings:
- Recognition mode (Fast / Balanced / Normal / Accurate) determines the speed of recognition and the quality of the text layer obtained as a result. To specify a recognition mode, in the Document Definition Editor, click Document Definition → Document Definition Properties... → Recognition).
- Recognition languages are the languages to be used for recognition. To specify recognition languages, in the Document Definition Editor, click Document Definition → Document Definition Properties... → Document Definition Settings, and then click Edit in the Countries and Languages group to select the required languages.
Note: Recognition languages in FlexiCapture for Invoices are tied to the country settings. When adding an invoice country to the Countries and Languages group, the corresponding languages will automatically appear in the Document Definition settings. Invoice fields are extracted upon recognition.
To detect and capture fields on an invoice, the program can use:
Both methods are described below, together with the algorithm that either combines the results obtained by using both these methods or selects the best result.
Business unit and vendor
The following may be used to determine the Vendor and Business Unit:
- Document Definition settings: IBAN, VATID, and NationalVATID formats, as well as the corresponding keywords;
- Data set record fields: IBAN, VATID, NationalVATID, Name, Street, City, ZIP.
Automatic company detection algorithm
The detail and quality of the information filled in in the Data set columns has a significant impact on detection quality. To ensure that the search results are as accurate as possible, make sure that:
- The unique company identifiers are filled in
Filling in unique value columns (VATID, NationalVATID, IBAN) will significantly improve the probability of correct detection, since these values are unique for all companies.
- There are no repeating company records
The absence of any repeating records will improve the probability of correctly detecting the company. For more information about eliminating duplicate records, see Eliminating duplicate records in the external database.
- There are no unrelated records
The presence of outdated or invalid records in the Data set may cause the company to be detected incorrectly because of coincidental similarities between various field values.
- All fields are filled in for every company record
Specify as much information about companies as possible. The more fields are filled in in the Data set, the higher the probability of correctly detecting the company.
- Multiple-value columns are used to store the same information that is denoted in different ways, and not different information altogether
For example, if a single company has several addresses, there must be a separate record for each of them, even if all other fields contain the same information. For more information, see Preparing vendor and business unit databases.
The automatic vendor and business unit detection algorithm consists of the following steps:
- Unique identifier search
The following fields are considered to be unique company identifiers:
ABBYY FlexiCapture for Invoices searches the document image for the values listed above. In the document definition properties (Document Definition Settings tab, Countries and Languages group), the VATID, NationalVATID, and IBAN (Formats tab) formats, as well as keywords (Keywords tab) are set for each country using regular expressions.
Note: Correctly filled in keywords and identifier formats significantly improve detection quality.
The program looks for exact matches on the image for such fields. Regular expressions can also take possible recognition errors into account. This is done by means of extended regular expressions (see Extended regular expressions).
Note: ABBYY FlexiCapture for Invoices offers preset regular expressions, however, you can create your own regular expressions if required. To do so, navigate to the Countries and Languages group in the Document Definition Settings tab, select the appropriate country and click Edit….
Detected values are normalized as follows:
- letters are changed to upper case ,
- spaces and the following characters are removed: " . ", " , ", " — ", " / ", " \ ".
If the letter prefix of a field is specified using a regular expression in the country properties in the Formats tab, the recognized prefix is replaced by the primary prefix (set in the country properties in the Formats tab).
For example, the identifier "DE12345" may be recognized as "OE12345". The detected prefix OE will then be replaced with the correct prefix DE.
The VATID, NationalVATID, and IBAN fields detected on a document image will be used to query the Data set. The VATID, NationalVATID, and IBAN column values received from the Data set fields are normalized the same manner as the values detected on the image, after which they are matched (exact matching is used) to the normalized values of fields detected on the image.
- Company name and address search
A query that uses all document text to look for records that match it the most accurately is sent to the Data set.
The Name, Street, ZIP, and City values detected on the image are matched to the corresponding Data set record values.
Note: To get the best possible name and company search results, make sure that the corresponding Data set columns are filled in. Company name and address information is especially important in cases where the company cannot be identified using VATID, NationalVATID, or IBAN.
- Hypothesis formation
Companies that have been found during steps 1 and 2, are used to form a set of hypotheses. ABBYY FlexiCapture for Invoices evaluates these hypotheses and then selects 5 vendor records and 5 business unit records that most reliably match the field values on the document image. These records are then used to form 25 pairs of vendors and business units, with each pair treated as a separate hypothesis. A neural network algorithm is then used to rate the hypotheses by reliability, with the best-fit vendor-BU pair becoming the final hypothesis and the results of the vendor and business unit detection.
Note: If only the vendor database is connected, the quality of the vendor-BU pair evaluation may be negatively affected. It is recommended that a business unit database is connected even if business unit detection is not required. For more information, see Connecting databases.
Note: If there is a very small number of business units (e.g. 1), connecting such a database will not have a significant impact on the evaluation. However, doing so may improve the detection quality in cases where a business unit is being incorrectly detected as a vendor.
Hypotheses are split into the following based on match reliability (Data set record and the document image field value):
- reliably matching the document image;
- unreliably matching the document image.
Depending on the verification scenario, you can decide whether or not to take hypothesis reliability into account when detecting the vendor and business unit. If you want ABBYY FlexiCapture for Invoices to select the final hypothesis exclusively from reliable hypotheses, you can filter them using the InvoiceReader/ShouldFilterUnsureCompanyHypotheses registry flag, which can be set to the following values:
- true — filtering is enabled, and the final hypothesis will be selected exclusively from the reliable hypotheses (default value);
- false — filtering is disabled, and the final hypothesis will be selected from all hypotheses regardless of their reliability;
Note that hypothesis filtering works differently for vendors and business units:
- When detecting vendors, no unreliable hypotheses for vendors will be considered. If there are no reliable hypotheses, a vendor will not be detected.
- When detecting business units:
- if at least one reliable hypothesis has been found, no unreliable hypotheses will be considered;
- if the set of hypotheses obtained during steps 1 through 3 does not contain at least one reliable hypothesis, the flag value will be ignored. The final hypothesis will be selected from the unreliable hypotheses.
The above is due to the differences between vendor and business unit Data sets:
- There are usually a lot less company business unit records than there are vendor records. They also change far less frequently, meaning that it is easier to keep them up-to-date. Therefore, detecting a reliable hypothesis increases the probability of the final hypothesis being correct. However, detecting a business unit is important even if no reliable hypotheses have been found, since the most important factor pertaining to the reliability of the detection result is the reliability evaluation of the vendor-BU pairs.
- There are usually a lot more vendor records, and the Data set contains more columns because vendors specify more information about their own company on their invoices (as opposed to the business unit). Records can also contain outdated information, meaning that unreliable hypothesis filtering will depend on both the quality of the Data set and the verification scenario type.
Note: To improve the probability of detecting reliable hypotheses, keep Data sets up-to-date and include as much information about vendors and business units as possible.
Results of detecting the vendor and business unit
The main results of detecting the vendor and business unit on the invoice are:
- the identifier of the vendor record in the Vendors data set
- the identifier of the business unit record in the BusinessUnits data set
Note: If the Vendors data set specifies that Id depends on BusinessUnitId (see Vendors data set), the result of vendor detection will contain the Id that corresponds to the BusinessUnitId.
A business unit may be detected unreliably. In this case, the document's registration parameter fc_Predefined:InvoiceIsVendorSuspicious (fc_Predefined:InvoiceIsBusinessUnitSuspicious) will be set to true.
The regions of the following fields may be found as a result of vendor and business unit detection:
For the vendor:
For the business unit:
Be examining the locations of these regions on the image you can see where exactly on the image the program has found the fields of the Vendor and Business Unit field groups, which enabled it to detect the vendor and the business unit.
Note: If the field values for IBAN and VATID are absent from the Vendors data set, keywords and format can be used to detect the appropriate values the same way that bank details are detected (if the corresponding vendor has been found).
Note: Search for any field region can be modified through training or by applying an additional FlexiLayout (See Capturing additional invoice fields) This will have no effect on vendor and business unit detection, but may affect the location of the regions of the fields in these field groups after matching the Document Definition with the invoices.
An important result of detecting the vendor and business unit is that information about their respective countries is retrieved from the CountryCode field of the records found in the data set. This information is then used to select keywords and tax rates and to capture other invoice fields. It is also used as a condition for launching validation rules for the invoice.
How to change the way the program detects the vendor or business unit
The better a vendor or business unit record in the data set matches the text extracted from an invoice image, the more accurately the program detects the vendor or business unit.
First, you need to identify the data in the external database that corresponds to the data set columns used for finding the company on an invoice. The external database and the data set have to be properly connected (see Using vendor and business unit databases).
If one and the same company occurs both in the list of vendors and in the list of business units, you must specify the same VATID for the respective records in both data sets (even if there is no VATID on invoices). This will prevent the program from detecting the vendor and business unit incorrectly.
To compensate for possible variations in field values on images, use:
Using pre-determined vendor and business unit values in conjunction with extracted values
The vendor or the business unit of the invoice's company can be determined in advanced based on the invoice's source (name of the Scanning Operator or the e-mail address of the message's sender).
You can specify the vendor and/or the business unit explicitly prior to automatic detection.
To do so, set the value of the document's registration parameter fc_Predefined:InvoicePredefinedVendorId (fc_Predefined:InvoicePredefinedBusinessUnitId) to the identifier (Id) of an entry in the Vendors or BusinessUnits data set.
Doing this does not prevent automatic detection of the vendor and/or the business unit from taking place. Thanks to this, in addition to the pre-determined vendor and/or business unit, you will get a confidence value (this value indicates how well the pre-determined values match values extracted from the image), as well as the regions of fields from the Vendor and/or Business Unit field groups.
Invoice Header field group
An invoice's header includes, among others, the InvoiceNumber and InvoiceDate fields.
These fields are detected using keywords that are specified in the language properties of the Document Definition. The vendor and the business unit are detected first, providing information about the countries of the vendor and business unit. The countries determine languages (languages that correspond to a country are specified in the Document Definition). The set of keywords for finding fields is taken from the countries of the vendor and the business unit.
How does the program determine that a document is an invoice?
ABBYY FlexiCapture determines whether a document is an invoice when applying the FlexiLayout.
The conditions listed below indicate that a document is an invoice. Not all of these conditions have to be met, but each one caries a certain weight.
- InvoiceNumber and InvoiceDate fields were detected.
- Keywords from the InvoiceIdentifiers located element were detected (See Keywords).
- A vendor or a business unit was detected on the document.
A document can be identified as a credit note if keywords from the CreditNoreKeyword element were detected on the image or if the document has a negative Total.
Amounts field group
ABBYY FlexiCapture for Invoices captures the following fields from an invoice:
Invoice Processing (Au-NZ),
Invoice Processing (US),
Invoice Processing (CA),
Invoice Processing (EU),
Invoice Processing (JP)
|Invoice Processing (ES)|
|The total sum of the invoice (Total) and the currency of the invoice (Currency)||Yes||Yes|
|Additional tax (AdditionalCosts)||Yes||Yes|
Information from the Document Definition is used to find sums and tax rates:
- Rates of taxes payable in the vendor's country (you can specify these on the Tax Rates tab of the country's properties, See Country and language settings)
- Keywords for tax rates (you can specify these on the Keywords tab of the language's properties. Also See Keywords).
The program will try to find up to two tax rates on the image. If there are more than two tax rates in the invoice, additional fields can be created and filled in manually on the data form.
The program uses keywords to detect the TotalTax and TotalNetto fields. You can specify these keywords in the properties of a country or language, depending on how the keyword should be used (for details, Country and language settings). For more on keywords, see Keywords.
There are two types of keywords for the Total field, located in different categories (for more on Located elements categories see Keywords):
- AmountTotalHighConfidenceLabels: keywords that only occur near the Total field, such as "Pay this amount."
- AmountTotalLowConfidenceLabels: keywords that can occur near the Total field but can also occur near other fields. For example, the keyword "Total" can appear near the Total field but may also occur near a field that contains the total weight of all items on an invoice.
Tip. If you are not sure which of these two categories to add a keyword to, add it to AmountTotalHighConfidenceLabels. If you encounter invoices where the keyword causes the program to identify another field as the Total field, you can move it to AmountTotalLowConfidenceLabels.
In addition to keywords, the program will look for the following items when attempting to detect the Total field:
- Numbers that occur two or three times in the same line or in the same column on the image. Such numbers may be the Total on invoices where no taxes are specified.
- Numbers that are sums of the numbers located above them in the same column.
- The largest (by absolute value) numbers located in the end of the document.
The program will search for the Currency field only if a Total field has been detected. Keywords from the properties of the country in the Document Definition will be used.
Any fields in the Amounts field group that could not be detected on the image will be calculated automatically, except for the Total field. This field must be detected on the image.
If the program fails to correctly extract information from the fields in the Amounts field group, the Total field is marked as requiring verification.
If the program fails to detect the Total and Currency fields with a high degree of confidence or fails to detect them altogether, you can use training to improve the quality of extraction.
Purchase Order field group
ABBYY FlexiCapture for Invoices can extract all purchase order numbers and their corresponding sums from the invoice.
This feature is disabled by default (See Purchase order matching).
To extract Purchase Order numbers, you will need a data set with a list of possible Purchase Order numbers and their sums (see PurchaseOrders data set).
The Purchase Order field can be extracted using:
- a regular expression;
- A data set containing possible purchase order numbers (see PurchaseOrders data set).
If a data set with possible purchase order numbers is used, ABBYY FlexiCapture for Invoices will search images for numbers from this data set. It is best to have as few purchase order numbers as possible in the database, and there are several things you can do to decrease their amount:
- Use the VendorId column of the data set. In this case the program will only use Purchase Order numbers from the invoice's vendor.
- Filter out purchase orders for which an invoice has already been received and only add the numbers of purchase orders for which no invoice has been received yet to the data set.
The program will search the database for sums that correspond to detected Purchase Order numbers.
The program will also search the image for all Purchase Order numbers, including those that are in the invoice's line items.
Purchase orders are usually generated by the buyer's ERP system, so invoices billed to a specific Business Unit tend to be similar. It is usually possible to describe them using a regular expression.
If there is a regular expression for purchase order numbers, the program will detect all numbers that satisfy the expression on images. The regular expression can be specified in an XML configuration file using the following tags:
For more on XML configuration files, see Editing invoice processing settings in XML files.
The Line Items field group
ABBYY FlexiCapture for Invoices can extract invoice line items from images.
Extraction of invoice line items is disabled by default (See Additional fields).
For a list of fields which the program extracts automatically, See Captured fields.
ABBYY FlexiCapture for Invoices first searches the image for a table. During this search, it uses the keywords for column titles which are specified for every language in the Document Definition's properties. Keywords for columns of invoice line items are also used for classifying items, i.e. for determining the type of each invoice line item column.
After this, the program uses information about detected columns and mathematical expressions to find invoice line items in the invoice's table.
Finally, the program searches invoice line items for fields from columns.
Training can be used to improve the quality of automatic line item extraction.
One of the main advantages offered by neural networks is their ability to self-learn: neural networks can detect complex dependencies existing among input data and make some useful generalizations.
The program includes two neural networks that can be used to capture the following fields:
- Vendor \ Name
- Vendor \ Address
- Business Unit \ Name
- Business Unit \ Address
- Purchase Orders \ Order Number
- Unit of measurement
- Unit Price
- Total Price Netto
For maximum precision, the program will use both a FlexiLayout and its neural networks to capture invoice fields. Those fields that the program fails to extract using its neural networks will be extracted using the FlexiLayout. If a field can be extracted both by the neural networks and the FlexiLayout, the program will intelligently combine the results obtained through both methods. How the results are combined depends on the field (see Combining the field detection results for details).
Disabling the neural networks
By default, the neural networks will be used as the second method of capturing document fields. If you need to process documents other than invoices within your invoice project, you may want to disable the neural network, as it was specifically trained to capture invoice fields and may not perform well on other types of documents.
To disable the neural network for the Line Items group:
- Open the Document Definition Editor.
- Click Document Definition Properties... → Document Definition Settings → Additional Fields and Features.
- Disable the Thorough extraction of invoice line items option.
To disable the neural network for the Invoice Header, Vendor, Business Unit, and Purchase Order groups:
- Open the Document Definition Editor.
- Click Document Definition Properties... → Document Definition Settings → Additional Fields and Features.
- Disable the Thorough extraction of invoice header fields option.
How the program combines the field detection results or selects the best result depends on the field. As a general rule, precedence will be given to the results obtained by the respective neural network. Exceptions to this rule are searches based on data sets and searches using regular expressions created for specific customer documents.
Invoice Header field group
The results obtained by the neural network will always have precedence for the following fields:
- Invoice Number
- Invoice Date
Business unit and vendor
By default, the business unit and vendor are detected based on a data set, provided a data set is selected.
Additionally, the following fields may be detected using the neural network if there is no corresponding record in the data set:
- VATID (ABN)
If no data set is selected, only the neural network will be used.
Purchase Order field group
The neural network will only be used if the value is not detected by means of a data set or a regular expression.
For line item fields, precedence will be given to the results obtained by the neural network. If the neural network detects the entire table of line items, this table will be used for further processing. Otherwise, the program will use the line items detected by means of the FlexiLayout.
If the neural network detects only the Description and TotalPriceNetto fields for each line item, they will be complemented with the fields detected by means of the FlexiLayout.