Normalization of values in data sets
This article describes the different types of normalization that can be used when adding columns from an external database to a data set in a ABBYY FlexiCapture for Invoices Document Definition, and the settings of these normalization types.
Normalization can be used to change the format of values that are written differently but essentially mean the same thing. Normalization enforces consistent formatting of values so that they can be compared. For example, the address and name of a company may be written in a variety of different ways. Since these values refer to the same company and the same address, they need to be normalized to enable the program to make a proper comparison.
The type of normalization can be specified for each column in a data set when mapping these columns to columns in an external database.
Normalization is only applied to values stored in the data set (the Cache data option must be enabled in the data set's properties). Values in the external database will not be changed.
How does normalization work during data extraction in program?
ABBYY FlexiCapture for Invoices offers two types of normalization for values from the data set.
This type of normalization is useful when comparing strings such as company names and addresses.
- White space (this includes newline and tab characters) and separation symbols are replaced with regular spaces.
- Periods used as separators (periods that are placed between words) are replaced with spaces and periods in abbreviations are removed.
- Normalization of conjunction symbols (&, +, -, /, ~):
- Sets of words that begin with a single-letter word and are separated by the same conjunction symbol are joined into a single word, e.g. R & D becomes R&D;
- In all other cases conjunction symbols are replaced with spaces, e.g. Procter&Gamble becomes Procter Gamble.
Normalization parameters are specified in the Normalization.xml file, which is stored in the project's folder.
The Normalization.xml file can be modified after the Dataset has been created (separately for each Dataset). To modify the standard normalization settings, do the following:
- Download the settings file using the DownloadNormalizationSettings FCAdminTools command.
- Make the appropriate changes.
- Upload the settings file using the UpdateNormalizationSettings FCAdminTools command.
Important! After updating the settings file, you need to update the data set. For more information, see Updating data sets.
Note: Significant changes may be made to the normalization algorithm in future versions of the program.
2. Alphanumeric code
This normalization type is useful when comparing alphanumeric codes such as tax ID numbers, bank accounts and post indexes.
All symbols except for numerals and letters are removed from values, allowing you to compare values while ignoring spaces, dashes, slashes and other arbitrary characters that these values may contain.
When normalization is applied, the Store normalized value option becomes available when mapping the data set column to a column in an external database.
- When this option is enabled, normalized values will be stored in the data set.
- When this option is disabled, original values from the external database will be copied to the data set.
This option does not affect data extraction or automated checks, but it does determine which value will be displayed to a user when the user searches for an entry in a dictionary.