Extract substring
This operation extracts data from the current substring, i.e. from cells in the selected columns.
Suppose your loaded data contains a column with comments and you want to extract certain information from these comments. To do this, you need to place this information into a separate column, map the fields of the new column, and apply the required analysis modules.
Configuring the operation
- Add a new operation and select Extract substring from the list.
For detailed instructions on creating operations, see Repository & Data Management > ETL in the Cloud > Operations. - In the operation editor, specify the following:
- Source column
From the drop-down list, select the column from which you need to extract data. - New column
Enter a name for your new column. This column will contain the extracted data. - Strategy
Select a strategy to be used to extract data from the source column you specified in step 2а: - Position
This strategy extracts characters occupying specific positions in the substring.
For example, if Start position is set to 1 and Extract string length is set to 3, the program will extract characters occupying positions 1 through 3. - Pattern
This strategy uses a regular expression to extract data.
You will need to specify a regular expression in the Pattern expression field (see Using regular expressions below).
Click Show regexp hints to see help on regular expressions. - Use filter
You can use filters if you need to apply an operation only to records that meet certain criteria (for example, only to those records where the "Employee" field is not empty).
For details, see Using filters with operations. - Use the Run preview option if you want to view the result of the operation before performing it. Only part of the table will be displayed in preview mode.
- Click Save. This will close the operation editor window, with the created operation appearing on the top panel.
Using regular expressions
The Extract substring operation can use a regular expression to extract text from a column. Regular expressions allow you to specify complicated search patterns.
The tables below list the available operators and quantifiers and provide examples of their use.
Important.Text search with regular expressions is case-sensitive.
Note: https://regex101.com/ offers an online service that helps you create and parse regular expressions, explaining the purpose of each operator. There you can also test your regular expression and get detailed information on how it works. Simply enter your regular expression in the REGULAR EXPRESSION field and see it parsed and explained in the EXPLANATION pane.
Operators
Operator | Description | Example |
---|---|---|
String of characters | Matches the specified string of characters. |
Regular expression: Column contains job titles: Regular expression will find: |
String of characters in round brackets ( ) |
Matches the specified string of characters. The round brackets allow additional operations to be applied to the string. |
Regular expression: Column contains job titles: Regular expression will find: |
String of characters in round brackets (?: ) |
Matches the specified string of characters. The matched substring will not be saved in the array of results. |
Regular expression:
|
Any character . (period) |
Matches any character—a letter, a digit, or a special symbol. |
Regular expression: Column contains names of employees: Regular expression will find: |
Characters in square brackets [ ] |
Matches any one of the specified characters. The search will stop as soon as a matching character is found. To specify a range of possible characters, use “-”: |
Regular expression: Column contains text that mentions social security numbers: Regular expression will find: |
Special symbols \ |
Matches any one of the following special symbols: Use the backslash to escape a special symbol. |
Regular expression: Column contains records marked with: "+" Regular expression will find: |
Number of occurrences \{} |
Specifies the number of times the preceding expression may occur. |
Regular expression: Column contains text that mentions extension phone numbers: Regular expression will find: |
Alternatives (More than one) | |
Combines multiple alternatives and matches any of one of them. |
Regular expression: Column contains job titles: Regular expression will find: |
Quantifiers
A quantifier specifies the allowed number of occurrences for the immediately preceding character, group of characters or class of characters.
Quantifier | Description | Example |
---|---|---|
? Alternative {0,1} |
Repeats the preceding character zero or one times. Makes the preceding character optional. To make optional a group of characters, enclose the characters in round brackets. |
Regular expression: Column contains names of employees: Regular expression will find: |
* Alternative {0,} |
Repeats the preceding character zero or more times. |
Regular expression: Column contains names of employees, some of them with typos: Regular expression will find: |
+ Alternative {1,} |
Repeats the preceding character one or more times. |
Regular expression: Column contains text that mentions book chapters: Regular expression will find: |
{m} | Repeats the preceding pattern m times. |
Regular expression: Column contains text that mentions vehicle registration year: Regular expression will find: |
{m,} | Repeats the preceding pattern at least m times. |
Regular expression: Column contains text that mentions customers' phone numbers: Regular expression will find: |
{m, n} | Repeats the preceding pattern m to n times (m cannot be greater than n). |
Regular expression: Column contains text that mentions vehicle registration year: Regular expression will find: |
Quantifiers may be:
- greedy
- non-greedy (also called lazy or reluctant)
Greedy is the default behavior and means that the quantifier will match as much as it can, i.e. the regular expression will initially look for the maximum number of matching characters and then will give back one character at a time if the remainder of the pattern cannot be matched.
let regexp = /".+"/g; let str = 'a "witch" and her "broom" is one'; alert( str.match(regexp) ); // "witch" and her "broom"
Conversely, a non-greedy quantifier will attempt to match the minimum number of occurrences.
let regexp = /".+?"/g; let str = 'a "witch" and her "broom" is one'; alert( str.match(regexp) ); // witch, broom
An example of a regular expression using multiple operators and quantifiers
Suppose you have a document that contains a date in the following format: Friday, June 18, 2021 8:45:30 PM
To detect the month, the following regular expression can be used:
(Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|Jul(y)?|Aug(ust)?|Sep(tember)?|Oct(ober)?|Nov(ember)?|Dec(ember)?)
To detect the time, the following regular expression can be used:
([0-1]?[0-9]|[2][0-3]):([0-5][0-9])(:[0-5][0-9])?
This regular expression will find hours, minutes, and seconds (if indicated).
9/5/2024 4:23:54 PM