Regular Expressions

This section describes the regular expression syntax supported by the ABBYY Mobile Capture SDK engine for capturing custom data fields (see How to Capture a Custom Data Field).

 Note: All matches are always greedy (match as much as possible). The search stops at the first match: if a string contains two or more substrings matching your regular expression, only the first one (closest to the beginning) is matched.

Supported syntax

Pattern Syntax Examples and comments
Literal any character or text, except metacharacters \^$.|?*+()[{

pill matches "pill" in "caterpillar"
a matches the first "a" in "caterpillar" but not the second (the search stops at the first match)

Metacharacters are part of regular expression syntax; to match these literally, you have to escape them with a backslash. If you want to match 1+1, the correct expression is 1\+1 — otherwise "+" has a special meaning.

Any character . (dot) s.t matches "sat", "sit" but not "seat"
Character set [] gr[ae]y matches both "gray" and "grey" but not "greay"
Character range in a set - (minus) [0-9] matches a single digit
concatenation is allowed: [a-zA-Z0-9] matches an alphanumeric character
Negated character set [^] [^0-9] matches anything that is not a digit
Shorthand classes \s — any whitespace
\S — anything that is not a whitespace
\d — any digit
\D — anything that is not a digit
\w — a word character, which includes alphanumerics and punctuation marks
\W — a non-word character
\R — a new line character or the CR LF sequence
\v — a new line character but not the CR LF sequence
\V — a non-new line character
\h — a horizontal white space character
\H — anything except horizontal white space
Non-printable characters \n — line feed LF
\r — carriage return CR
\t — tab character
\f — form feed
\a — bell character \u0007
\e — escape character
Unicode character \uFFFF
\x{FFFF}
\u20AC or \x{20AC} matches the euro currency sign.
Character by its hexadecimal index \xFF \xA9 matches the copyright character in the Latin-1 character set
Alternation | abc|123 matches either "abc" or "123"
|word matches either an empty string "" or "word"
Repetitions +
*
?
{n}
{n,m}
{n,}
{,m}

+ matches once or more times
* matches zero or more times
? matches zero times or once (optional match)
{n} matches exactly n times
{n,m} matches n to m times times
{n,} matches n or more times
{,m} matches zero or more times up to m

Note that all repetitions are greedy (prefer to match as much as possible): c.+r will match "caterpillar", not stopping with "cater". If you want to match up to the first occurence of a certain character, use its negation: c[^r]+r will match "cater" in "caterpillar".

Grouping () (word)+ matches "word", "wordword" and so on

Unsupported syntax

The following regular expression syntax features are not yet supported in ABBYY Mobile Capture SDK:

  • Anchors: ^ (beginning of a line), $ (end of a line), \b (word boundary) and its negation \B, and other.
  • Lazy quantifiers such as +? or {n,m}? that prefer to match as few times as possible.
  • Concatenation with nested character sets such as [[a-z][0-9]].
  • Advanced features such as lookarounds, backreferences, possessive matches, named groups, non-capturing and atomic match groups, evaluation flag settings and other.

02.03.2022 12:59:15

Usage of Cookies. In order to optimize the website functionality and improve your online experience ABBYY uses cookies. You agree to the usage of cookies when you continue using this site. Further details can be found in our Privacy Notice.