When to use extraction scripts

Extraction results can sometimes be improved by using extraction scripts alongside an NLP model. You may want to use extraction scripts if:

  • You need to extract entities from a table.
  • You do not have enough sample documents to train your NLP model.
  • You are not satisfied with the quality of extraction on some of the fields.

Extraction scripts allow you to

  • Identify text spans that match
    • certain regular expressions
    • certain words or phrases from user dictionaries occurring in any inflected form in the text
    • any of the built-in NER objects:
      • People (NerPerson)
      • Organizations (NerOrg)
      • Locations (NerGeo)
      • Addresses (NerAddress)
      • Amounts of money (NerMoney)
      • Dates (NerDate)
      • Duration (NerDuration, available only for Russian and English texts)
      • Account numbers (NERAccountNumber, available only for Russian texts)
        Note: The NerMoney, NerDate, NerDuration and NERAccountNumber objects are used only in extraction scripts.
  • Run queries on text and text spans where search words and phrases may occur in any inflected form.
  • Save any identified text spans into document fields.
  • Extract addresses and the following address components from documents:
    • ZIP code (NerZipCode)
    • Country (NerCountry)
    • State (NerState)
    • City (NerCity)
    • Street (NerStreet)

To create an extraction script or select user dictionaries to be used, follow these steps:

  1. Open the Document Definition editor.
  2. Select a document section, right-click it, and click Properties… on the shortcut menu.
  3. Click the NLP tab.
  4. Under Extraction Scripts, click Create....
  5. In the Extraction Script dialog box,
    • Click the Load… button to load a user dictionary, or
    • Click the Edit... button to open the script editor.

Note: The user dictionaries should be encoded in UTF-8 with BOM or ANSI.

Extracting address components from a document

To extract address components, do the following:

  1. Specify the area of the document that contains the address.

    We recommend that you restrict the search area with a FlexiLayout field and then use that area as a source for an extraction script. For more information, see Search constraints.

    An address may only contain one instance of each of the following components: ZIP code, country, state, city, and street. However, an extraction script may return several instances of a component. The more precisely you define the search area for an address, the fewer instances will be returned.
  2. Apply the appropriate extraction script.
    You can search for address components in the entire field or in a part of the field.

When using the ParseAddressInPosition( resultCollectionNamePrefix : string, startPos : int, endPos : int ) and ParseAddressInSpan( resultCollectionNamePrefix : string, span : IInterval ) methods to parse an address, each word in the detected components receives the following attributes during indexing, which can then be used in XML queries:

  1. The name of the collection in the format [resultCollectionNamePrefix]_[NerTypeOfComponent].
  2. The resultCollectionNamePrefix prefix. 
  3. The type of the NER object.

See below for a sample XML address extraction query.

Note: Currently, you can only extract components of German and US addresses.

4/12/2024 6:16:02 PM

Please leave your feedback about this article

Usage of Cookies. In order to optimize the website functionality and improve your online experience ABBYY uses cookies. You agree to the usage of cookies when you continue using this site. Further details can be found in our Privacy Notice.