Working with ABBYY FineReader Engine Regular Expressions

Regular expressions are used in regular-expression-based dictionaries to define what words are allowed in a language and what are not.

Regular expression rules

The ABBYY FineReader Engine regular expression alphabet is described in the following table:

Item name	Conventional regular expression sign	Usage examples and explanations
Any character	.	c.t — denotes words like "cat", "cot"
Character from a character range	[]	[b-d]ell — denotes words like "bell", "cell", "dell" [ty]ell — denotes words "tell" and "yell"
Character out of a character range	[^]	[^y]ell — denotes words like "dell", "cell", "tell", but forbids "yell" [^n-s]ell — denotes words like "bell", "cell", but forbids "nell", "oell", "pell", "qell", "rell", and "sell"
Or	\|	c(a\|u)t — denotes words "cat" and "cut"
0 or more occurrences in a row	*	10* — denotes numbers 1, 10, 100, 1000, etc.
1 or more occurrences in a row	+	10+ — allows numbers 10, 100, 1000, etc., but forbids 1.
Letter or digit	[0-9a-zA-Z]	[0-9a-zA-Z] — allows a single character; [0-9a-zA-Z]+ — allows any word
Capital Latin letter	[A-Z]
Small Latin letter	[a-z]
Capital Cyrillic letter	[А-Я]
Small Cyrillic letter	[а-я]
Digit	[0-9]
Space	[\s]
System character	@
Word from dictionary	@(Dictionary)	The Dictionary parameter sets the path to the user dictionary from which words must be taken. Backslashes in the path must be doubled. For example: @(D:\\MyFolder\\MyDictionary.amd). Note: Some programming languages (such as C++) require you to escape backslashes in string literals. In this case, you will need two escaped backslashes, which will result in a quadrupled backslash. The example above will look like this in C++: L"@(D:\\\\MyFolder\\\\MyDictionary.amd)"

Notes:

Some characters used in regular expressions are "auxiliary," i.e., they are used for system purposes. As you can see from the list above, such characters include square brackets, periods, etc. If you wish to enter an auxiliary character as a normal one, put a backslash (\) before it. Example: [t-v]x+ denotes words like "tx", "txx", "txxx", etc., "ux", "uxx", etc., but \[t-v\]x+ denotes words like "[t-v]x", "[t-v]xx", "[t-v]xxx" etc.
If you need to group certain regular expression elements, use brackets. For example, (a|b)+|c denotes "c" and any combinations like "abbbaaabbb", "ababab", etc. (a word of any non-zero length in which there may be any number of a's and b's in any order), whilst a|b+|c denotes "a", "c", and "b", "bb", "bbb", etc.

Sample regular expressions

Regular expression for dates

The number denoting day may consist of one digit (e.g., 1, 2 etc.) or two digits (e.g., 02, 12), but it cannot be zero (00 or 0). The regular expression for the day will then look like this: ((|0)[1-9])|([12][0-9])|(30)|(31).

The regular expression for the month will look like this: ((|0)[1-9])|(10)|(11)|(12).

The regular expression for the year will look like this: (((19)|(20))[0-9][0-9])|([0-9][0-9]).

What is left is to combine all this together and separate the numbers by a period (e.g., 1.03.1999). The period is an auxiliary sign, so we must put a backslash (\) before it. The regular expression for the full date will then look like this:

(((|0)[1-9])|([12][0-9])|(30)|(31))\.(((|0)[1-9])|(10)|(11)|(12))\.((((19)|(20))[0-9][0-9])|([0-9][0-9]))

Regular expression for e-mail addresses

You can easily make a language for denoting e-mail addresses. The regular expression for an e-mail address may look like this:

[a-zA-Z0-9_\-\.]+\@[a-zA-Z0-9\.\-]+\.[a-zA-Z]+

Using for data capture

If you use regular expressions in field-level recognition, you generally need to recognize only the words which are exact matches for the regular expression. In this case, we recommend creating a separate language for recognizing the fields and setting the following properties for it:

Only dictionary words must be allowed as recognition results: set the IBaseLanguage::AllowWordsFromDictionaryOnly property to TRUE. This is necessary for the exact matching.
The letter set for the recognition language must contain only those characters that are included in the regular expression: specify the IBaseLanguage::LetterSet property. This is necessary because the characters from the language alphabet can be recognized even if they do not fit the regular expression.
Set the IBaseLanguage::IsNaturalLanguage property to FALSE.

Working with ABBYY FineReader Engine Regular Expressions

Regular expression rules

Sample regular expressions

Using for data capture

Samples

C# code

See also

Your use of this site is conditioned on Your continued compliance with the Terms of Use.

Terms of Use

Disclaimer of Warranty

Limitation of Liability

Transmission and Submission of Information

Downloads

Use of Content

Trademarks

Links to Third-Party Sites

Foreign Legislation

Subscription Terms

Partner Subscription Terms

Working with ABBYY FineReader Engine Regular Expressions

Regular expression rules

Sample regular expressions

Using for data capture

Samples

C# code

See also