When to use extraction scripts
Extraction results can sometimes be improved by using extraction scripts alongside an NLP model. You may want to use extraction scripts if:
- You need to extract entities from a table.
- You do not have enough sample documents to train your NLP model.
- You are not satisfied with the quality of extraction on some of the fields.
Extraction scripts allow you to
- Identify text spans that match
- certain regular expressions
- certain words or phrases from user dictionaries occurring in any inflected form in the text
- any of the built-in NER objects:
- People (NerPerson)
- Organizations (NerOrg)
- Locations (NerGeo)
- Addresses (NerAddress)
- Amounts of money (NerMoney)
- Dates (NerDate)
- Duration (NerDuration, available only for Russian and English texts)
- Account numbers (NERAccountNumber, available only for Russian texts)
Note: The NerMoney, NerDate, NerDuration and NERAccountNumber objects are used only in extraction scripts.
- Run queries on text and text spans where search words and phrases may occur in any inflected form.
- Save any identified text spans into document fields.
- Extract addresses and the following address components from documents:
- ZIP code (NerZipCode)
- Country (NerCountry)
- State (NerState)
- City (NerCity)
- Street (NerStreet)
- Open the Document Definition editor.
- Select a document section, right-click it, and click Properties… on the shortcut menu.
- Click the NLP tab.
- Under Extraction Scripts, click Create....
- In the Extraction Script dialog box,
- Click the Load… button to load a user dictionary, or
- Click the Edit... button to open the script editor.
Note: The user dictionaries should be encoded in UTF-8 with BOM or ANSI.
Extracting address components from a document
To extract address components, do the following:
- Specify the area of the document that contains the address.
We recommend that you restrict the search area with a FlexiLayout field and then use that area as a source for an extraction script. For more information, see Search constraints.
An address may only contain one instance of each of the following components: ZIP code, country, state, city, and street. However, an extraction script may return several instances of a component. The more precisely you define the search area for an address, the fewer instances will be returned.
- Apply the appropriate extraction script.
You can search for address components in the entire field or in a part of the field.
When using the ParseAddressInPosition( resultCollectionNamePrefix : string, startPos : int, endPos : int ) and ParseAddressInSpan( resultCollectionNamePrefix : string, span : IInterval ) methods to parse an address, each word in the detected components receives the following attributes during indexing, which can then be used in XML queries:
- The name of the collection in the format [resultCollectionNamePrefix]_[NerTypeOfComponent].
- The resultCollectionNamePrefix prefix.
- The type of the NER object.
See below for a sample XML address extraction query.
Note: Currently, you can only extract components of German and US addresses.