Working with Dictionaries
ABBYY FineReader Engine allows you to attach dictionaries of various types to a recognition language, which greatly improves recognition quality.
Dictionary types
Dictionaries may be of several types:
Standard dictionary
User dictionary
Regular-expression-based dictionary
External dictionary
ABBYY FineReader Engine provides a DictionaryDescription object for describing all types of dictionaries. This is the basic object from which the descriptions of different dictionary types are inherited.
All these dictionary descriptions are elements of the DictionaryDescriptions collection.
Creating a dictionary description
To create dictionary descriptions, the AddNew method of the DictionaryDescriptions object is used. This method returns a reference to the DictionaryDescription object. To obtain a reference to the object describing the corresponding dictionary type, use the GetAsStandardDictionaryDescription, GetAsUserDictionaryDescription, GetAsRegExpDictionaryDescription, GetAsExternalDictionaryDescription methods of the DictionaryDescription object.
Dictionary properties
For each dictionary, the identification property of the dictionary must be specified:
- For a standard dictionary (StandardDictionaryDescription), specify its LanguageId property, which defines the ID of the language.
- For a user dictionary (UserDictionaryDescription), specify its FileName property, which provides the path to the user dictionary.
- For a regular-expression-based dictionary (RegExpDictionaryDescription), use the SetText method to specify the regular expression. See Working with ABBYY FineReader Engine Regular Expressions.
- For an external dictionary (ExternalDictionaryDescription), use the SetDictionary method to specify the dictionary.
All dictionary types are assigned a weight. The weight of a dictionary affects the weight of words from the given dictionary when they are detected during recognition. The weight parameter is a percentage and must be non-negative. A weight of 0 does not automatically mean that there is no such dictionary. Weights of more than 100 percent are allowed, but the user must be very careful when using such parameters. The weight is specified in the IDictionaryDescription::Weight property and is set to 100 by default.
Standard dictionaries also have a CanUseTrigrams option which allows or forbids the program to use trigrams built on the basis of the selected dictionary. Trigrams are combinations of three letters. Not all of these combinations occur in real words. A word with a non-dictionary trigram is very likely to be unpronounceable. Trigrams are used to cut off unreliable words. We recommend enabling trigrams for "general" standard dictionaries and disabling them for dictionaries of terms.
Dictionaries of a recognition language
A text recognition language (the TextLanguage object) can have both dictionaries containing words of the language and dictionaries with prohibited words. The first ones are specified for each basic recognition language of the text language and are accessible via the IBaseLanguage::DictionaryDescriptions property. A base language may have no dictionary attached to it. The prohibiting dictionaries are attached directly to the text recognition language through the ITextLanguage::ProhibitingDictionaries property.
If you want only the dictionary words to be allowed during recognition, set the IBaseLanguage::AllowWordsFromDictionaryOnly property to TRUE. In this case, a word that is not found in the dictionary of the base language can appear in the recognized text only if ABBYY FineReader Engine found no dictionary variants.
How to attach a dictionary to a recognition language
- Create a TextLanguage object using one of the available methods (e.g., the CreateTextLanguage method of the LanguageDatabase object).
- Obtain the collection of base languages of the new text language (use the BaseLanguages property).
- Create a new BaseLanguage object and add it to a collection of base languages.
- Obtain the collection of dictionary descriptions of the new base language (the DictionaryDescriptions property).
- Create a dictionary description and add it to the collection of dictionary descriptions of the base language. Use the AddNew method of the DictionaryDescriptions collection.
Note: You can create several dictionaries of different types and add them to the DictionaryDescriptions collection of one base recognition language.
- [Optional] Specify the weight of the created dictionary.
- Specify the identification property of the dictionary: the LanguageId property for a standard dictionary, the FileName property for a user dictionary, call the IRegExpDictionaryDescription::SetText method for a regular-expression-based dictionary or call the IExternalDictionaryDescription::SetDictionary method for an external dictionary.
- [Optional] Specify other properties of the BaseLanguage object.
- [Optional] Set the prohibiting dictionaries using the ProhibitingDictionaries property of the TextLanguage object.
- Assign the created TextLanguage object to the TextLanguage property of the RecognizerParams object.
Cache dictionary
A cache dictionary is a small dictionary (about a hundred words) that can be changed easily during processing. Cache dictionaries can be used when it is possible to select a dictionary more precisely, e.g., if you find new information about the document during processing. Such dictionaries are suitable for field-level recognition.
For example, suppose there are two fields on a form you need to recognize: the name of a city and the name of a street. You have recognized the name of the city, and you have the list of streets in this city. In this case, you may load the appropriate cache dictionary with the street names and thus recognize the name of the street more quickly and accurately.
ABBYY FineReader Engine provides the AddWordsToCacheDictionary, AddWordToCacheDictionary, and CleanCacheDictionary methods of the FRPage object for working with cache dictionaries.
Important! To use the cache dictionary, you should set the IEngine::AutoCleanRecognizerSession property to FALSE. The AutoCleanRecognizerSession property is set to TRUE by default, which means that FineReader Engine cleans its recognition session after recognition of each page, in which case the cache dictionary is cleaned too. To prevent accidental destruction of user data, FineReader Engine prohibits using of cache dictionaries in this mode. If you use the cache dictionary, it is your concern to clean the recognition session manually by calling the IEngine::CleanRecognizerSession method when necessary. See the description of the method to find out when it is necessary to clean the recognition session.
See also
7/3/2024 8:50:10 AM