English (English)

Extract substring

This operation extracts data from the current substring, i.e. from cells in the selected columns.

Suppose your loaded data contains a column with comments and you want to extract certain information from these comments. To do this, you need to place this information into a separate column, map the fields of the new column, and apply the required analysis modules.

Configuring the operation

  1. Add a new operation and select Extract substring from the list.
    For detailed instructions on creating operations, see Repository & Data Management > ETL in the Cloud > Operations.
  2. In the operation editor, specify the following:
    1. Source column
      From the drop-down list, select the column from which you need to extract data.
    2. New column
      Enter a name for your new column. This column will contain the extracted data.
    3. Strategy
      Select a strategy to be used to extract data from the source column you specified in step 2а:
      1. Position
        This strategy extracts characters occupying specific positions in the substring.
        For example, if Start position is set to 1 and Extract string length is set to 3, the program will extract characters occupying positions 1 through 3.
      2. Pattern
        This strategy uses a regular expression to extract data. 
        You will need to specify a regular expression in the Pattern expression field (see Using regular expressions below).
        Click Show regexp hints to see help on regular expressions. 
    4. Use filter
      You can use filters if you need to apply an operation only to records that meet certain criteria (for example, only to those records where the "Employee" field is not empty).
      For details, see Using filters with operations. 
    5. Use the Run preview option if you want to view the result of the operation before performing it. Only part of the table will be displayed in preview mode.
  3. Click Save. This will close the operation editor window, with the created operation appearing on the top panel.

Using regular expressions

The Extract substring operation can use a regular expression to extract text from a column. Regular expressions allow you to specify complicated search patterns.

The tables below list the available operators and quantifiers and provide examples of their use.

Important.Text search with regular expressions is case-sensitive.

Note: https://regex101.com/ offers an online service that helps you create and parse regular expressions, explaining the purpose of each operator. There you can also test your regular expression and get detailed information on how it works. Simply enter your regular expression in the REGULAR EXPRESSION field and see it parsed and explained in the EXPLANATION pane.

Operators

Operator Description Example
String of characters Matches the specified string of characters.

Regular expression:
marketing

Column contains job titles:
"accountant"
"marketing specialist"
"marketing manager"

Regular expression will find:
" "
"marketing"
"marketing"

String of characters in round brackets

( )

Matches the specified string of characters. The round brackets allow additional operations to be applied to the string.

Regular expression:
(marketing)

Column contains job titles:
"accountant"
"marketing specialist"
"marketing manager"

Regular expression will find:
" "
"marketing"
"marketing"

String of characters in round brackets

(?:  )

Matches the specified string of characters. The matched substring will not be saved in the array of results.

Regular expression:
(?:marketing)

Any character

. (period)

Matches any character—a letter, a digit, or a special symbol.

Regular expression:
(Mark .)

Column contains names of employees:
“Mark Stanford”
“Dianne Millington”
“Mark Wood”

Regular expression will find:
“Mark S”
“ ”
“Mark W”

Characters in square brackets

[ ]

Matches any one of the specified characters. The search will stop as soon as a matching character is found.

To specify a range of possible characters, use “-”:
[0-9] stands for any digit
[a-z] stands for ant lower-case letter
[A-Z] stands for any upper-case letter
[a-zA-Z] stands for any letter

Regular expression:
[0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9][0-9][0-9]

Column contains text that mentions social security numbers:
“No. 934-80-1840”
“SSN 870-28-7383"
“SSN unknown”

Regular expression will find:
“934-80-1840"
“870-28-7383"
" "

Special symbols

\

Matches any one of the following special symbols:
\ ^ $ . | ? * + ( ) [ {

Use the backslash to escape a special symbol.

Regular expression:
\+

Column contains records marked with:

"+"
"-"

Regular expression will find:
"+"
" "

Number of occurrences

\{}

Specifies the number of times the preceding expression may occur.

Regular expression:
[0-9]\{4}

Column contains text that mentions extension phone numbers:
"Phone 1111"
"Extension 2222"
"No extension number"

Regular expression will find:
"1111"
"2222"
" "

Alternatives

(More than one)

|

Combines multiple alternatives and matches any of one of them.

Regular expression:
(M|m)arketing

Column contains job titles:
"accountant"
"marketing manager"
"Marketing Manager"

Regular expression will find:
" "
“marketing”
“Marketing”

Quantifiers

A quantifier specifies the allowed number of occurrences for the immediately preceding character, group of characters or class of characters.

Quantifier Description Example

?

Alternative

{0,1}

Repeats the preceding character zero or one times. Makes the preceding character optional. To make optional a group of characters, enclose the characters in round brackets.

Regular expression:
Mark (Stanford)?

Column contains names of employees:
“Mark Stanford”
“Dianne Millignton”
“Frank Wood”
“Mark Hamilton”

Regular expression will find:
“Mark Stanford”
“ ”
“ ”
“Mark”

*

Alternative

{0,}

Repeats the preceding character zero or more times.

Regular expression:
Mark Stanf.*

Column contains names of employees, some of them with typos:
“Mark Stanford”
“Dianne Millington”
“Mark Stanfard”
“Mark Stanfort”

Regular expression will find:
“Mark Stanford”
“ ”
“Mark Stanfard”
“Mark Stanfort”

+

Alternative

{1,}

Repeats the preceding character one or more times.

Regular expression:
[C|c]hapter [0-9]+

Column contains text that mentions book chapters:
“Chapter 3 describes...”
"chapter 15 provides information about..."

Regular expression will find:
"Chapter 3"
"chapter 15"

{m} Repeats the preceding pattern m times.

Regular expression:
\d{4}

Column contains text that mentions vehicle registration year:
“Vehicle registration year 2019”.

Regular expression will find:
“2019”

{m,} Repeats the preceding pattern at least m times.  

Regular expression:
\d{10,}

Column contains text that mentions customers' phone numbers:
"Phone 89991234567"
"Tel. 9991234567"

Regular expression will find:
"89991234567"
"9991234567"

{m, n} Repeats the preceding pattern m to n times (m cannot be greater than n).

Regular expression:
\d{2,4}

Column contains text that mentions vehicle registration year:
“Vehicle registration year 2019”
“Vehicle registration year '19”

Regular expression will find:
"2019"
"19"

Quantifiers may be:

  • greedy
  • non-greedy (also called lazy or reluctant)

Greedy is the default behavior and means that the quantifier will match as much as it can, i.e. the regular expression will initially look for the maximum number of matching characters and then will give back one character at a time if the remainder of the pattern cannot be matched.

let regexp = /".+"/g;
    let str = 'a "witch" and her "broom" is one';
    alert( str.match(regexp) ); // "witch" and her "broom" 
    

Conversely, a non-greedy quantifier will attempt to match the minimum number of occurrences.

let regexp = /".+?"/g; let
    str = 'a "witch" and her "broom" is one';
    alert( str.match(regexp) ); // witch, broom
    

An example of a regular expression using multiple operators and quantifiers

Suppose you have a document that contains a date in the following format: Friday, June 18, 2021 8:45:30 PM

To detect the month, the following regular expression can be used:
(Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|Jul(y)?|Aug(ust)?|Sep(tember)?|Oct(ober)?|Nov(ember)?|Dec(ember)?)

To detect the time, the following regular expression can be used:
([0-1]?[0-9]|[2][0-3]):([0-5][0-9])(:[0-5][0-9])?
This regular expression will find hours, minutes, and seconds (if indicated).

05.09.2024 16:23:54

Usage of Cookies. In order to optimize the website functionality and improve your online experience ABBYY uses cookies. You agree to the usage of cookies when you continue using this site. Further details can be found in our Privacy Notice.