- Introducing ABBYY FlexiCapture
- Installing and running the program
- ABBYY FlexiCapture architecture
-
Program settings
- ABBYY FlexiCapture Setup
- Multitenancy
- Creating a project
-
Document Definitions
- Creating fixed Document Definitions
- Creating Document Definitions for semi-structured documents
- Document Definitions without automatic fields extraction
- Document sets
-
Document Definition fields
- Text entry fields
- Checkmarks
- Checkmark groups
- Barcodes
- Pictures
- Tables
- Field group
- Service fields
- Index fields
- Link to an existing field
- Fields without a region
- Creating a field with a non-rectangular region
- Fields with several instances
- Fields with several regions
- How to change a field name
- Copying, moving, deleting fields
- Exclusion of a region from recognition
- Document Definition Wizard
- Editing and publishing a Document Definition
- Creating Document Definitions
- Document Definition properties
- Properties of a Document Definition section
- Rule validation
- Export settings
- Configuring data presentation in the document window
- Testing Document Definitions
- Localizing a Document Definition
- Classification
- Field extraction training
- Operating a configured project
-
ABBYY FlexiCapture for Invoices
- Features of ABBYY FlexiCapture for Invoices
- How to capture invoices
-
How to set up an invoice capture project
- Country and language settings
- Connecting vendor and business unit databases
- Data export settings
- The status of documents in ABBYY FlexiCapture for Invoices projects
- Training ABBYY FlexiCapture for Invoices
- Rules
- Capturing additional invoice fields
- Purchase Order Matching
- Enabling additional program features for operators
- Using multiple Document Definitions
- Editing invoice processing settings in XML files
- Updating the Document Definition for invoices
- Tax systems
- Specifications
- Capturing receipts
- Capturing purchase orders
- Using NLP to process unstructured documents
- ABBYY FlexiCapture interface
-
Appendix
-
Using scripts in ABBYY FlexiCapture
- Specifics of scripts written in .Net languages
- External assemblies
- Object model
-
Scripts for customizing processing stages
-
Types of scripts
- Script rule
- Autocorrection script
- Export script
- User script (custom action)
- Document assembly script
- Custom recognition script
- Stage rule
- Processing scripts
- Data set update script
- Data set validation scripts
- Document classification script
-
Event handlers
- Batch created
- Batch deleted
- Batch parameter change
- Batch structure change (page added/page deleted/document added/document deleted)
- Pages moved
- Batch opened/closed
- Batch integrity check
- Document parameter changed
- Document state changed
- Export completed
- Script that is run after rule checks
- Before matching
- Field verification request
-
Objects
- IActionResult
- IAssemblingError
- IAssemblingErrors
- IBatch
- IBatchCheckResults
- IBatchItem
- IBatchItems
- IBatchTypeClassifier
- IBatchTypeClassifierResult
- IBinarizationParams
- IBoxedBoolean
- ICharacterParams
- ICharactersParams
- ICheckmarkGroupValue
- ICheckmarkValue
- IDataSet
- IDataSetQuery
- IDataSetRecord
- IDocument
- IDocuments
- IDocumentExportResults
- IDocumentsExportResults
- IDocumentDefinitionInfo
- IDocumentDefinitionInfoArray
- IEditablePictureObject
- IExportFieldsToRedact
- IExportImageSavingOptions
- IField
- IFieldRegion
- IFieldRegions
- IFields
- IFlexiCaptureTools
- ILocalContrastParams
- IMatchedSectionInfo
- IMatchingInfo
- IPage
- IPageClassificationResult
- IPages
- IPictureObject
- IPictureObjectsInfo
- IPrincipal
- IPrincipals
- IProcessingCallback
- IProject
- IProperties
- IProperty
- IPropertyModificationInfo
- IRecordCheckResult
- IRecordset
- IRect
- IRects
- IRoutingRuleResult
- IRuleContext
- IRuleError
- IRuleErrors
- IRuleTag
- IRuleTags
- IScriptBinaryAttributes
- IScriptDefinitionContext
- ISectionDefinitionInfo
- ISectionDefinitionInfoArray
- IShadowsHighlightsParams
- IStageInfo
- IUserAttachment
- IUserAttachments
- IUserSessionInfo
- IValue
- IVARIANTArray
- TAssemlingErrorType
- TBatchItemType
- TColorToFilter
- TExportFieldType
- TExportType
- TImageCompressionType
- TPageClassificationType
- TPdfAVersion
- TPdfDocumentInfoType
- TPdfTextSearchAreaType
- TPrincipalType
- TProcessingPriority
- TPropertyType
- TRuleErrorType
- TStateType
- Sample scripts
- Internal names of recognition languages
-
Types of scripts
-
Scripts for processing interface events
-
Event handlers
- On Document Closed
- On Project Closed
- On Activate Document
- On Field Control Activate
- On Return From Task
- On User Command
- On Field Control Deactivate
- On Closing Document
- On Task Close
- On Closing Project
- On Region Change
- On Task Window Mode Changed
- On Open Document
- On Task Window Create
- On Task Reject
- On Region Control Draw
- On Task Send To Stage
- On Text Field Validating
-
Objects
- IBoolean
- IBoxedFieldControl
- IDocumentEditor
- IDocumentItem
- IDocumentItems
- IDocumentsCollection
- IDocumentsWindow
- IDrawContext
- IErrorControl
- IErrorControls
- IErrorsWindow
- IFieldControl
- IFieldRegionControl
- IFieldRegionControls
- IFormWindow
- IImageWindow
- IMainMenu
- IMainWindow
- IMenu
- IMenuItem
- IPageControl
- IPageItem
- IPageItems
- IPagesCollection
- IPoint
- ISelection
- IShellRational
- IShellRect
- IShellRects
- ITaskWindow
- ITextEditor
- IToolbar
- IToolbarButton
- IToolbars
- TCommandBarType
- TCommandID
- TDockingType
- TDocumentState
- TErrorType
- TSelectionType
- TTaskWindowMode
- TTextSize
- TUserRole
- TWorkWindowType
-
Event handlers
- User scripts for the Web Verification Station
- Creating a machine-readable form
- Hot keys
- Additional options
- Description of Processing Server commands
- ABBYY FlexiCapture sample projects
- Supported recognition languages
- Supported classifier languages
- Fonts for correct characters rendering
- Supported text types
- Supported barcode types
- Supported input formats
- Processing PDF files
- Export file formats
- Date formats
- Alphabet used in regular expressions
- Patents
- Third-party technologies
- Glossary
- Technical support
- How to buy ABBYY FlexiCapture
- End-User License Agreement (EULA)
-
Using scripts in ABBYY FlexiCapture
Detecting the main fields
This article describes how the main fields of an invoice are detected and captured.
The program starts processing an invoice by pre-recognizing its text in accordance with the Document Definition settings:
- Pre-recognition mode (Very fast / Fast / Balanced / Thorough) determines the speed of pre-recognition and the quality of the text layer obtained as a result. To specify a pre-recognition mode, in the Document Definition Editor, click Document Definition → Document Definition Properties... → Recognition).
- Pre-recognition languages are the languages to be used for pre-recognition. To specify pre-recognition languages, in the Document Definition Editor, click Document Definition → Document Definition Properties... → Document Definition Settings, and then click Edit in the Countries and Languages group to select the required languages.
Once an invoice is pre-recognized, the program starts capturing its fields.
To detect and capture fields on an invoice, the program can use:
Both methods are described below, together with the algorithm that either combines the results obtained by using both these methods or selects the best result.
Using a FlexiLayout
Business unit and vendor
Once the program detects the business unit and the vendor on an invoice, it obtains:
- the identifier of the corresponding business unit record in the BusinessUnits data set
- the identifier of the corresponding vendor record in the Vendors data set
The program use the following to detect the vendor (or business unit):
- the following fields of the record in the data set: IBAN, VATID, National VATID, Name, Street, City, ZIP
- the settings of the Document Definition: IBAN, VATID, and National VATID formats and their corresponding keywords
For more about the columns in the BusinessUnits and Vendors data sets and how they are used, see Columns in the BusinessUnits data set and Columns in the Vendors data set.
Company detection algorithm
- Looking for the company IDs
The fields that uniquely identify a company are:
- VATID
- National VATID
- IBAN
ABBYY FlexiCapture for Invoices will first look for the company that fits the invoice in the data set, relying on the above fields. For this search to be successful, the corresponding columns of the company record in the data set must be completed, and their values must be present on the image.
The formats of VATID, National VATID, and IBAN are specified by means of regular expressions in the properties of the Document Definition (click Document Definition → Document Definition Properties... → Document Definition Settings, click Edit in the Countries and Languages, select the required country in the Countries list, click Edit, and then click the Formats tab). The keywords are specified in the same dialog box on the Keywords tab.
For these fields to be detected, the program must find their exact matches on the image. You can use extended regular expressions to give the program some leeway and make allowances for possible recognition errors (see Extended regular expressions).
The detected values of the VATID, National VATID, and IBAN fields are then normalized:
- all letters are changed to upper case,
- spaces and the following characters are removed: ., ,, —, /, \.
If, on the Formats tab, a regular expression is used to specify a letter prefix of a field value, the recognized prefix will be replaced by the primary prefix specified on the same tab.
For example, the identifier "DE12345" may be recognized as "OE12345," but normalization will replace the "OE" with the correct "DE."
The program will use the VATID, National VATID, and IBAN fields detected on the image to query the data set. The values of the VATID, National VATID, and IBAN columns received from the data set are normalized in the same manner as the values of the fields detected on the image. Next, the values from the data set are matched against the values on the image (fuzzy matching is used) and the number of discrepancies is counted.
A company record from the data set is considered to be a reliable match for the company on the image, if:
- at least two identifiers in the record are identical to those detected on the image, with no more than one discrepancy
or - all the three identifiers of the record (VATID, National VATID, and IBAN) are identical to those detected on the image, with no more than two discrepancies
If a reliable match is found, the search stops.
Otherwise, the program looks for the name and the address of the company, using as hypotheses the records it has retrieved from the data set.
- Looking for the company name and address
The program uses the records it retrieved from the Data set as hypotheses when looking for the name and address of the company.
If searching for the unique IDs the program gets a non-empty set of records from the data set, this set of records is used.
If the program fails to find the company name and address relying on this set of records or if this set of records is empty:
- the VATID, National VATID, and IBAN fields have not been detected on the image
or - the VATID, National VATID, and IBAN columns of the matching record in the data set have no values
the program will query the data set using the entire text of the page. As a result, the company records that best match the text on the page are returned and are then used as hypotheses.
The program will search the image for the Name field and the Street, ZIP, and City components of the address from each record serving as a hypothesis:
- The program looks for the Name and Street fields both as one string with 25% of errors allowed or as individual words (in the latter case, the clustering of words on the image is taken into account).
- The program looks for the ZIP and City fields as a single component, expecting both fields to be positioned on the same line.
The detected values of the Name, Street, ZIP, and City fields are compared with the values in the data set record and the number of discrepancies is counted.
The decision about which data set record best matches the image is made as follows:
- The record is considered to be a reliable match if, besides the unique identifiers (VATID, National VATID, IBAN), the Name field and at least one address component (i.e. Street or the combination of ZIP and City) match the text on the image and there are no more than four discrepancies.
- The record is considered to be an unreliable match if, besides the unique identifiers (VATID, National VATID, IBAN), all the address components match the text on the image, but the Name field does not match the image text.
- If the data set record has no unique identifiers (VATID, National VATID, IBAN), or if they have not been detected on the image, the program will select as the reliable match the record with a matching Name and at least one matching address component (i.e. Street or the combination of ZIP and City), and there no more than eight discrepancies in total.
Otherwise the record with the highest number fields matching the values detected on the images is considered to be an unreliable match.
Results of detecting the vendor and business unit
The main results of detecting the vendor and business unit on the invoice are:
- the identifier of the vendor record in the Vendors data set
- the identifier of the business unit record in the BusinessUnits data set
Note: If the Vendors data set specifies that Id depends on BusinessUnitId (see Columns in the Vendors data set for details), the result of vendor detection will contain the Id that corresponds to the BusinessUnitId.
A business unit may be detected unreliably. In this case, the document's registration parameter fc_Predefined:InvoiceIsVendorSuspicious (fc_Predefined:InvoiceIsBusinessUnitSuspicious) will be set to true.
The regions of the following fields may be found as a result of vendor and business unit detection:
For the vendor:
- Name
- VatID
- NationalVatID
- IBAN
- Street
- Zip
- City
For the business unit:
- Name
- VatID
- Street
- Zip
- City
Be examining the locations of these regions on the image you can see where exactly on the image the program has found the fields of the Vendor and Business Unit field groups, which enabled it to detect the vendor and the business unit.
Note: If the field values for IBAN and VATID are absent from the Vendors data set, keywords and format can be used to detect the appropriate values the same way that bank details are detected (if the corresponding vendor has been found).
Note: Search for any field region can be modified through training or by applying an additional FlexiLayout (See Capturing additional invoice fields) This will have no effect on vendor and business unit detection, but may affect the location of the regions of the fields in these field groups after matching the Document Definition with the invoices.
An important result of detecting the vendor and business unit is that information about their respective countries is retrieved from the CountryCode field of the records found in the data set. This information is then used to select keywords and tax rates and to capture other invoice fields. It is also used as a condition for launching validation rules for the invoice.
How to change the way the program detects the vendor or business unit
The better a vendor or business unit record in the data set matches the text extracted from an invoice image, the more accurately the program detects the vendor or business unit.
First, you need to identify the data in the external database that correspond to the data set columns used for finding the company on an invoice. The external database and the data set have to be properly connected (see Using vendor and business unit databases).
If one and the same company occurs both in the list of vendors and in the list of business units, you must specify the same VATID for the respective records in both data sets (even if there is no VATID on invoices). This will prevent the program from detecting the vendor and business unit incorrectly.
To compensate for possible variations in field values on images, use:
- normalization of data set columns (see Normalization of Values in data sets),
- multiple-value data set columns (see Multiple-value columns in a data set).
Using pre-determined vendor and business unit values in conjunction with extracted values
The vendor or the business unit of the invoice's company can be determined in advanced based on the invoice's source (name of the Scanning Operator or the e-mail address of the message's sender).
You can specify the vendor and/or the business unit explicitly prior to automatic detection.
To do so, set the value of the document's registration parameter fc_Predefined:InvoicePredefinedVendorId (fc_Predefined:InvoicePredefinedBusinessUnitId) to the identifier (Id) of an entry in the Vendors or BusinessUnits data set.
Doing this does not prevent automatic detection of the vendor and/or the business unit from taking place. Thanks to this, in addition to the pre-determined vendor and/or business unit, you will get a confidence value (this value indicates how well the pre-determined values match values extracted from the image), as well as the regions of fields from the Vendor and/or Business Unit field groups.
Invoice Header field group
InvoiceNumber, InvoiceDate
An invoice's header includes, among others, the InvoiceNumber and InvoiceDate fields.
These fields are detected using keywords that are specified in the language properties of the Document Definition. The vendor and the business unit are detected first, providing information about the countries of the vendor and business unit. The countries determine languages (languages that correspond to a country are specified in the Document Definition). The set of keywords for finding fields is taken from the countries of the vendor and the business unit.
You can change the way the program looks for regions of fields by editing keywords (see Keywords) and by using training (see Training).
How does the program determine that a document is an invoice?
ABBYY FlexiCapture determines whether a document is an invoice when applying the FlexiLayout.
The conditions listed below indicate that a document is an invoice. Not all of these conditions have to be met, but each one caries a certain weight.
- InvoiceNumber and InvoiceDate fields were detected.
- Keywords from the InvoiceIdentifiers located element were detected (See Keywords).
- A vendor or a business unit was detected on the document.
A document can be identified as a credit note if keywords from the CreditNoreKeyword element were detected on the image or if the document has a negative Total.
Amounts field group
ABBYY FlexiCapture for Invoices captures the following fields from an invoice:
Field |
Invoice Processing (Au-NZ), Invoice Processing (US), Invoice Processing (CA), Invoice Processing (EU), Invoice Processing (JP) |
Invoice Processing (ES) |
---|---|---|
The total sum of the invoice (Total) and the currency of the invoice (Currency) | Yes | Yes |
Taxes:
|
Yes | Yes |
|
No | Yes |
Additional tax (AdditionalCosts) | Yes | Yes |
Information from the Document Definition is used to find sums and tax rates:
- Rates of taxes payable in the vendor's country (you can specify these on the Tax Rates tab of the country's properties, See Country and language settings)
- Keywords for tax rates (you can specify these on the Keywords tab of the language's properties. Also See Keywords).
The program will try to find up to two tax rates on the image. If there are more than two tax rates in the invoice, additional fields can be created and filled in manually on the data form.
The program uses keywords to detect the TotalTax and TotalNetto fields. You can specify these keywords in the properties of a country or language, depending on how the keyword should be used (for details, Country and language settings). For more on keywords, see Keywords.
There are two types of keywords for the Total field, located in different categories (for more on Located elements categories see Keywords):
- AmountTotalHighConfidenceLabels: keywords that only occur near the Total field, such as "Pay this amount."
- AmountTotalLowConfidenceLabels: keywords that can occur near the Total field but can also occur near other fields. For example, the keyword "Total" can appear near the Total field but may also occur near a field that contains the total weight of all items on an invoice.
Tip. If you are not sure which of these two categories to add a keyword to, add it to AmountTotalHighConfidenceLabels. If you encounter invoices where the keyword causes the program to identify another field as the Total field, you can move it to AmountTotalLowConfidenceLabels.
In addition to keywords, the program will look for the following items when attempting to detect the Total field:
- Numbers that occur two or three times in the same line or in the same column on the image. Such numbers may be the Total on invoices where no taxes are specified.
- Numbers that are sums of the numbers located above them in the same column.
- The largest (by absolute value) numbers located in the end of the document.
The program will search for the Currency field only if a Total field has been detected. Keywords from the properties of the country in the Document Definition will be used.
Any fields in the Amounts field group that could not be detected on the image will be calculated automatically, except for the Total field. This field must be detected on the image.
If the program fails to correctly extract information from the fields in the Amounts field group, the Total field is marked as requiring verification.
If the program fails to detect the Total and Currency fields with a high degree of confidence or fails to detect them altogether, you can use training to improve the quality of extraction.
Purchase Order field group
ABBYY FlexiCapture for Invoices can extract all purchase order numbers and their corresponding sums from the invoice.
This feature is disabled by default (See Purchase order matching).
To extract Purchase Order numbers, you will need a data set with a list of possible Purchase Order numbers and their sums (See Columns in the PurchaseOrders data set).
The Purchase Order field can be extracted using:
- a regular expression;
- A data set containing possible purchase order numbers (see Columns in the PurchaseOrders data set).
If a data set with possible purchase order numbers is used, ABBYY FlexiCapture for Invoices will search images for numbers from this data set. It is best to have as few purchase order numbers as possible in the database, and there are several things you can do to decrease their amount:
- Use the VendorId column of the data set. In this case the program will only use Purchase Order numbers from the invoice's vendor.
- Filter out purchase orders for which an invoice has already been received and only add the numbers of purchase orders for which no invoice has been received yet to the data set.
The program will search the database for sums that correspond to detected Purchase Order numbers.
The program will also search the image for all Purchase Order numbers, including those that are in the invoice's line items.
Purchase orders are usually generated by the buyer's ERP system, so invoices billed to a specific Business Unit tend to be similar. It is usually possible to describe them using a regular expression.
If there is a regular expression for purchase order numbers, the program will detect all numbers that satisfy the expression on images. The regular expression can be specified in an XML configuration file using the following tags:
.<InvoiceSettings> ... <OrderNumber> <Value> <RegularExpression></RegularExpression> </Value> </OrderNumber> </InvoiceSettings>
For more on XML configuration files, see Editing invoice processing settings in XML files.
The Line Items field group
ABBYY FlexiCapture for Invoices can extract invoice line items from images.
Extraction of invoice line items is disabled by default (See Additional fields).
For a list of fields which the program extracts automatically, See Captured fields.
ABBYY FlexiCapture for Invoices first searches the image for a table. During this search, it uses the keywords for column titles which are specified for every language in the Document Definition's properties. Keywords for columns of invoice line items are also used for classifying items, i.e. for determining the type of each invoice line item column.
After this, the program uses information about detected columns and mathematical expressions to find invoice line items in the invoice's table.
Finally, the program searches invoice line items for fields from columns.
Training can be used to improve the quality of automatic line item extraction.
Using neural networks
One of the main advantages offered by neural networks is their ability to self-learn: neural networks can detect complex dependencies existing among input data and make some useful generalizations.
The program includes two neural networks that can be used to capture the following fields:
- InvoiceNumber
- InvoiceDate
- Total
- Vendor \ Name
- Vendor \ Address
- Business Unit \ Name
- Business Unit \ Address
- Purchase Orders \ Order Number
- LineItems:
- OrderNumber
- OrderDate
- Position
- ArticleNumber
- Description
- Quantity
- Unit of measurement
- Unit Price
- Total Price Netto
- VATPercentage
For maximum precision, the program will use both a FlexiLayout and its neural networks to capture invoice fields. Those fields that the program fails to extract using its neural networks will be extracted using the FlexiLayout. If a field can be extracted both by the neural networks and the FlexiLayout, the program will intelligently combine the results obtained through both methods. How the results are combined depends on the field (see Combining the field detection results for details).
Disabling the neural networks
By default, the neural networks will be used as the second method of capturing document fields. If you need to process documents other than invoices within your invoice project, you may want to disable the neural network, as it was specifically trained to capture invoice fields and may not perform well on other types of documents.
To disable the neural network for the Line Items group:
- Open the Document Definition Editor.
- Click Document Definition Properties... → Document Definition Settings → Additional Fields and Features.
- Disable the Thorough extraction of invoice line items option.
To disable the neural network for the Invoice Header, Vendor, Business Unit, and Purchase Order groups:
- Open the Document Definition Editor.
- Click Document Definition Properties... → Document Definition Settings → Additional Fields and Features.
- Disable the Thorough extraction of invoice header fields option.
Combining the field detection results
How the program combines the field detection results or selects the best result depends on the field. As a general rule, precedence will be given to the results obtained by the respective neural network. Exceptions to this rule are searches based on data sets and searches using regular expressions created for specific customer documents.
Invoice Header field group
The results obtained by the neural network will always have precedence for the following fields:
- Invoice Number
- Invoice Date
- Total
Business unit and vendor
By default, the business unit and vendor are detected based on a data set, provided a data set is selected.
Additionally, the following fields may be detected using the neural network if there is no corresponding record in the data set:
- Name
- VATID (ABN)
- Address
If no data set is selected, only the neural network will be used.
Purchase Order field group
The neural network will only be used if the value is not detected by means of a data set or a regular expression.
Line items
For line item fields, precedence will be given to the results obtained by the neural network. If the neural network detects the entire table of line items, this table will be used for further processing. Otherwise, the program will use the line items detected by means of the FlexiLayout.
If the neural network detects only the Description and TotalPriceNetto fields for each line item, they will be complemented with the fields detected by means of the FlexiLayout.
02.03.2021 8:10:42