Query language
A query specifies what words or combinations of words a document must contain. The query language used by extraction scripts is based on XML.
A simple query may look something like this:
- <Request> — This is the root element of a query.
- <Query> — This is a basic query. A basic query contains a query tree that specifies certain conditions to be met by the combination of words that you are looking for. A basic query may be part of a combination of word forms in a higher-level query. This is a required element.
- <Contain> — Specifies what words or combinations of words a document must contain. A query may have only one <Contain> element. This is a required element. This element may include any number of the following elements, in any order:
- <Required> — Specifies a word form or subquery that a word combination must contain. This is an optional element.
- <Optional> — Specifies a word form or subquery that a word combination may contain. This is an optional element.
- <Except> — Specifies a word form or subquery that a word combination must not contain. This is an optional element.
Each of the above elements must contain the following elements:
- <Query> — Specifies whether a string of word forms must be included in or excluded from the word combination.
- <Form> — Specifies whether a word form must be included in or excluded from a word combination.
Note: The <Contain> element must contain at least one <Required> or <Optional> element.
Attributes of the <Contain> tag
Additional restrictions on a combination of word forms can be specified in the attributes of the <Contain> tag.
MinCount attribute for the minimum number of items in a combination of word forms
The minimum number of items in a combination of word forms can be specified in a MinCount attribute. Each item is either a word form or a string of word forms returned by the subquery. The default value of the attribute is 1. The value of this parameter must not be greater than the total number of <Required> and <Optional> elements in the given combination of word forms.
This parameter is useful when a combination of word forms contains one or more <Optional> elements.
First, here is an example of a query that looks for either first or the second parameter
The following query tells the program to find at least three of the four words specified in the query, of which two words are required:
The query above will find the following phrases: “US President Barack Obama,” “President Barack Obama,” and “US President Obama.”
KeepOrder attribute for a fixed order of items in a combination of word forms
To specify that the order of items in a combination of word forms is fixed, you can use a KeepOrder attribute of type Boolean. The default value of the attribute is false.
Here is an example of a query that specifies that the order of items is fixed:
The combination “first third second third” will not match the query, even though the query does contain a string of words arranged in the required order.
KeepOrder will also apply to any <Except> elements. Any words corresponding to the <Except> elements in a query must not occur between the words that they separate in the query; however, they may occur in any other positions beyond the fixed-order query strings.
For example, if we modify the above query by putting the word form “second” into an <Except> element instead of <Required>, then a document containing the combination of word forms “first third second” will match the query (but the word form “second” will not be included in the result). If we further modify the above query by removing the KeepOrder attribute, a document containing the combination of word forms “first third second” will not be included in the result because the word form “second” must not occur anywhere in the text of a document.
MinDistance and MaxDistance attributes for specifying distances
The MinDistance and MaxDistance attributes specify the minimum and maximum distances between words in a query. These attributes have no default values. If either of the two attributes is not specified, no distance limitations will apply.
Distance between words is measured in words and is calculated as the difference in the positions of the two corresponding words. The distance between two neighboring words equals 1 and so the minimum value of either attribute is 1. MaxDistance must be greater than or equal to MinDistance.
The distance between two strings of words is calculated as follows. If the strings do not overlap, then the distance is calculated as the difference between the position of the left-most word in the right string and the position of the right-most word in the left string. If the strings overlap, then the distance is assumed to be 0.
For example, in the phrase “The quick brown fox jumped over the lazy dog,” the distance between the strings “quick fox” and “lazy dog” is 4 and the distance between the strings “quick fox” and “brown lazy dog” is 0.
For <Except> elements, the distance between words is calculated as follows:
- If KeepOrder="true," the word must not occur within the specified distance of the neighboring words in the string (i.e. those words between which it occurs in the query). At the same time, the distance between the neighbors of the <Except> element must be within the specified range.
- If KeepOrder="false," then the word must not occur within the specified distance of any other word in the string.
Example 1:
This query will find such phrases as “sodium carborate decahydrate” and “sodium sulfate decahydrate.”
Example 2:
This query will only find “sodium sulfate decahydrate,” because the word "tetraborate" is placed in the <Except></Except> tags and the maximum distance between "tetraborate" and “sodium” is two words.
<Form> element for getting a word form
A query for one word form is specified using a <Form> element. This element may contain the following elements:
- <Attributes> — An optional element that contains a query for the attributes of a word form.
- <Text> — An optional element that contains the Unicode text of a word form.
If no word from text is specified, any word will match the query that matches the attributes query, which is required in this case. - Additional search conditions for a word form can be specified in the attributes of the <Form> tag.
SearchType attribute for specifying the type of word form search
The type of word form search can be specified in the SearchType attribute of the <Form> tag. This attribute may have the following values:
- AllFormsSearch – The program will look for all the forms of the specified word.
- ExactSearch – The program will look only for the specified form of the word.
- PrefixSearch – The program will look for any word forms prefixed by the specified string.
- FuzzySearch – The program will perform a fuzzy search for the specified word. FuzzySearch may be useful if you have reasons to believe that your texts may contain OCR errors and ExactSearch will not work. FuzzySearch can only be used for words containing at least 3 characters. For words between 3 and 5 characters in length, FuzzySearch allows 1 OCR error. For words longer than 5 characters, FuzzySearch allows up to 2 OCR errors.
- FuzzyPrefixSearch – The program will perform a fuzzy search for any words prefixed by the specified string.
The SearchType attribute is optional and is set to AllFormsSearch by default.
CaseSensitive attribute for case-sensitive searches
For case-sensitive searches, the CaseSensitive attribute of the <Form> tag can be used. This attribute is optional and is set to false by default.
Here is a query which illustrates the use of attributes of the
<Form>
tag:
This query will look for the acronym WHO in exactly this form, which helps avoid a large number of redundant results containing “who,” “whom,” or “whose.”
<Attributes> element for getting word form attributes
A query for word form attributes is a logical expression built with AND, OR, and NOT operators. NOT is a unary operator, whereas AND and OR are n-ary. The operands of this logical expression are values of type Bool. This logical expression is written in the form of a tree. The result of the query will be the word forms in the text of a document that satisfy this logical expression.
To get word form attributes, use an <Attributes> element is used. This element may contain the following elements:
- <Attribute> — This element contains the text of the required word attribute. It is a leaf on the logical expression tree.
- <Or> — The OR operator is a node in the tree.
- <And> — The AND operator is a node in the tree.
- <Not> — The NOT operator is a node in the tree.
The <Not> element is constructed in the same manner as the <Attributes> element and may contain only one of the above elements.
The <Or> and <And> elements must contain at least two of the above elements.
The <Attribute> tag has an optional SearchType attribute, which specifies the type of attribute search. This attribute may have the following values:
- ExactSearch — The program will look for the attribute in exactly the same form as specified in the query.
- PrefixSearch — The program will look for any attributes prefixed by the specified text.
Search for attributes is always case-sensitive. The SearchType attribute is set to ExactSearch by default.
Suppose we have a document where we have already identified:
- NER objects (by calling the ExtractNerObjects function)
- word forms from a user dictionary named “dictionary” (by calling the ExtractWordsFromUserDictionary function)
- all objects that satisfy a regular expression that we have passed as a parameter (by calling the ExtractRegularExpression function)
Let’s also assume that the resulting collection of these objects is named “regExp.”
Note: The name of the collection can be used in XML queries performed on the indexed document. The resulting collection itself can be accessed by its name.
Then a form attribute query will look something like this:
This query will look for a two-word phrase, where the first word must match the specified regular expression and the second is an allowed name of an organization.
The digits after the attribute names serve to single out the required words in detected multi-word objects that have been indexed with the respective attributes. For example, a regular expression named “date” may find a date in the format “May 31, 2019.” Then date 1 will correspond to the word “May,” date 2 will correspond to “31,” and date 3 will correspond to “2019.”
<FormSet> element for getting a set of word forms
A query for multiple word forms is specified using a <FormSet> element.
This type of query combines multiple one-form queries using OR. It is equivalent to a <Query> query where all subqueries are optional word form queries.
For a <FormSet> query, however, you can specify an attribute query common to all the forms. This will make for more efficient searches when ExactSearch is used to find all the word forms, there is an attribute query for each form, and all these attribute queries share some common fragment.
A <FormSet> element contains the following elements:
- <Attributes> — This is an optional element that contains a query for form attributes. This query is combined with form attribute queries using AND.
- <Form> — This is a required element that contain a query for a word form. A <FormSet> element must contain at least one <Form> element.
Sample XML address extraction query
The result of an address extraction query will be a text string containing the first word of the country, the first word of the street, the first word of the city, the first word of the state, and the first word of the ZIP code (in that order and provided they are no more than 5 words apart from one another).
The string of consecutive words making up a component (e.g. a street name) are additionally numbered in the index, starting from 1.
var xmlQuery = "<Request> \
<Query> \
<Contain MaxDistance=\"5\" KeepOrder=\"true\"> \
<Optional> \
<Form><Attributes><Attribute>NerCountry1</Attribute></Attributes></Form> \
</Optional> \
<Optional> \
<Form><Attributes><Attribute>NerStreet1</Attribute></Attributes></Form> \
</Optional> \
<Optional> \
<Form><Attributes><Attribute>NerCity1</Attribute></Attributes></Form> \
</Optional> \
<Optional> \
<Form><Attributes><Attribute>NerState1</Attribute></Attributes></Form> \
</Optional> \
<Required> \
<Form><Attributes><Attribute>NerZipCode1</Attribute></Attributes></Form> \
</Required> \
</Contain> \
</Query> \
</Request>";
The result of an address extraction query is recorded into a repeating xmlQueryResult field:
this.RunQueryAndSaveToField( xmlQuery, "query", "xmlQueryResult" );
12.04.2024 18:16:02