Regular Expressions

Download

The table below lists the regular expressions that can be used to create a dictionary for a custom language.

Item name Conventional regular expression symbol Usage examples and explanations
Any Character . c.t — denotes "cat," "cot," etc.
Character from Group []

[b-d]ell — denotes "bell," "cell," "dell," etc.

[ty]ell — denotes "tell" and "yell"

Character not from Group [^]

[^y]ell — denotes "dell," "cell," "tell," but forbids "yell"

[^n-s]ell — denotes "bell," "cell," but forbids "nell," "oell," "pell," "qell," "rell," and "sell"

Or | c(a|u)t — denotes "cat" and "cut"
0 or More Matches * 10* — denotes numbers 1, 10, 100, 1000, etc.
1 or More Matches + 10+ — allows numbers 10, 100, 1000, etc., but forbids 1
Letter or Digit [0-9a-zA-Zа-яА-Я]

[0-9a-zA-Zа-яА-Я] — allows any single character

[0-9a-zA-Zа-яА-Я]+ — allows any word

Capital Latin Letter [A-Z]
Small Latin Letter [a-z]
Capital Cyrillic letter [А-Я]
Small Cyrillic letter [а-я]
Digit [0-9]
@ Reserved.

Note:

  1. To use a regular expression symbol as a normal character, precede it with a backslash. For example, [t-v]x+ stands for tx, txx, etc., ux, uxx, etc., and vx, vxx, etc., but \[t-v\]x+ stands for [t-v]x, [t-v]xx, [t-v]xxx, etc.
  2. To group regular expression elements, use brackets. For example, (a|b)+|c stands for c or any combinations like abbbaaabbb, ababab, etc. (a word of any non-zero length in which there may be any number of a's and b's in any order), while a|b+|c stands for a, c, b, bb, bbb, etc.

Examples

Suppose you are recognizing a table with three columns: birth dates, names, and e-mail addresses. In this case, you can create two new languages, Data and Address, and specify the following regular expressions for them.

Regular expression for dates:

The number denoting a day may consist of one digit (1, 2, etc.) or two digits (02, 12), but it cannot be zero (00 or 0). The regular expression for the day should then look like this: ((|0)[1-9])|([1|2][0-9])|(30)|(31).

The regular expression for the month should look like this: ((|0)[1-9])|(10)|(11)|(12).

The regular expression for the year should look like this: ([19][0-9][0-9]|([0-9][0-9])|([20][0-9][0-9]|([0-9][0-9]).

Now all we need to do is combine all this together and separate the numbers by period (e.g. 1.03.1999). The period is a regular expression symbol, so you must put a backslash (\) before it. The regular expression for the full date should then look like this:

((|0)[1-9])|([1|2][0-9])|(30)|(31)\.((|0)[1-9])|(10)|(11)|(12)\.((19)[0-9][0-9])|([0-9][0-9])|([20][0-9][0-9]|([0-9][0-9])

Regular expression for e-mail addresses:

[a-zA-Z0-9_\-\.]+\@[a-z0-9\.\-]+

14.01.2020 17:26:19

Please leave your feedback about this article

Usage of Cookies. In order to optimize the website functionality and improve your online experience ABBYY uses cookies. You agree to the usage of cookies when you continue using this site. Further details can be found in our Privacy Notice.