Regular Expressions
This section describes the regular expression syntax supported by the ABBYY Mobile Capture SDK engine for capturing custom data fields (see How to Capture a Custom Data Field).
Note: All matches are always greedy (match as much as possible). The search stops at the first match: if a string contains two or more substrings matching your regular expression, only the first one (closest to the beginning) is matched.
Supported syntax
Pattern | Syntax | Examples and comments |
---|---|---|
Literal | any character or text, except metacharacters \^$.|?*+()[{ |
pill matches "pill" in "caterpillar" Metacharacters are part of regular expression syntax; to match these literally, you have to escape them with a backslash. If you want to match 1+1, the correct expression is 1\+1 — otherwise "+" has a special meaning. |
Any character | . (dot) | s.t matches "sat", "sit" but not "seat" |
Character set | [] | gr[ae]y matches both "gray" and "grey" but not "greay" |
Character range in a set | - (minus) | [0-9] matches a single digit concatenation is allowed: [a-zA-Z0-9] matches an alphanumeric character |
Negated character set | [^] | [^0-9] matches anything that is not a digit |
Shorthand classes | \s — any whitespace \S — anything that is not a whitespace \d — any digit \D — anything that is not a digit \w — a word character, which includes alphanumerics and punctuation marks \W — a non-word character \R — a new line character or the CR LF sequence \v — a new line character but not the CR LF sequence \V — a non-new line character \h — a horizontal white space character \H — anything except horizontal white space |
|
Non-printable characters | \n — line feed LF \r — carriage return CR \t — tab character \f — form feed \a — bell character \u0007 \e — escape character |
|
Unicode character | \uFFFF \x{FFFF} |
\u20AC or \x{20AC} matches the euro currency sign. |
Character by its hexadecimal index | \xFF | \xA9 matches the copyright character in the Latin-1 character set |
Alternation | | | abc|123 matches either "abc" or "123" |word matches either an empty string "" or "word" |
Repetitions | + * ? {n} {n,m} {n,} {,m} |
+ matches once or more times Note that all repetitions are greedy (prefer to match as much as possible): c.+r will match "caterpillar", not stopping with "cater". If you want to match up to the first occurence of a certain character, use its negation: c[^r]+r will match "cater" in "caterpillar". |
Grouping | () | (word)+ matches "word", "wordword" and so on |
Unsupported syntax
The following regular expression syntax features are not yet supported in ABBYY Mobile Capture SDK:
- Anchors: ^ (beginning of a line), $ (end of a line), \b (word boundary) and its negation \B, and other.
- Lazy quantifiers such as +? or {n,m}? that prefer to match as few times as possible.
- Concatenation with nested character sets such as [[a-z][0-9]].
- Advanced features such as lookarounds, backreferences, possessive matches, named groups, non-capturing and atomic match groups, evaluation flag settings and other.
3/2/2022 12:59:15 PM