Normalization of values in data sets

This article describes the different types of normalization that can be used when adding columns from an external database to a data set in a ABBYY FlexiCapture for Invoices Document Definition, and the settings of these normalization types.

Normalization can be used to change the format of values that are written differently but essentially mean the same thing. Normalization enforces consistent formatting of values so that they can be compared. For example, the address and name of a company may be written in a variety of different ways. Since these values refer to the same company and the same address, they need to be normalized to enable the program to make a proper comparison.

The type of normalization can be specified for each column in a data set when mapping these columns to columns in an external database.

Normalization is only applied to values stored in the data set (the Cache data option must be enabled in the data set's properties). Values in the external database will not be changed.

How does normalization work during data extraction in program?

1. Text

This type of normalization is useful when comparing strings such as company names and addresses.

  1. White space (this includes newline and tab characters) and separation symbols are replaced with regular spaces.
  2. Periods used as separators (periods that are placed between words) are replaced with spaces and periods in abbreviations are removed.
  3. Normalization of conjunction symbols (&, +, -, /, ~):
    • Sets of words that begin with a single-letter word and are separated by the same conjunction symbol are joined into a single word, e.g. R & D becomes R&D;
    • In all other cases conjunction symbols are replaced with spaces, e.g. Procter&Gamble becomes Procter Gamble.
  4. Double spaces are removed.
  5. A list specified in advance is used to split words. For example, CoKG is split into Co KG.
  6. Spaces in recognized text are used to split it into separate words.
  7. A list specified in advance is used to replace suffixes in each word. For example, you can replace the suffix strasse with the suffix str.
  8. Automatic replacement of strings of words according to list specified in advance. For example, you can replace the work Limited with the abbreviation Ltd.

Normalization parameters are specified in the Normalization.xml file, which is stored in the project's folder.


Note: Significant changes may be made to the normalization algorithm in future versions of the program.

2. Alphanumeric code

This normalization type is useful when comparing alphanumeric codes such as tax ID numbers, bank accounts and post indexes.

All symbols except for numerals and letters are removed from values, allowing you to compare values while ignoring spaces, dashes, slashes and other arbitrary characters that these values may contain.

When normalization is applied, the Store normalized value option becomes available when mapping the data set column to a column in an external database.

  • When this option is enabled, normalized values will be stored in the data set.
  • When this option is disabled, original values from the external database will be copied to the data set.

This option does not affect data extraction or automated checks, but it does determine which value will be displayed to a user when the user searches for an entry in a dictionary.

01.12.2020 7:03:59

Please leave your feedback about this article