Working with Dictionaries
ABBYY FineReader Engine allows you to attach dictionaries of various types to a recognition language, which greatly improves recognition quality.
Dictionaries may be of several types:
This type of dictionary is already provided for the predefined languages that have built-in dictionary support (see the comments in the list of predefined languages). Additionally, for some languages, there are dictionaries of specialized terms (e.g., medical and law) packed in the .zmd archive. Standard dictionaries are represented by three or four files. They have names that are usually the same as the full or short name of the language and an .amd, .amm, .amt, or .ame extension. Files with .amd, .amm, and .amt extensions are always present (they are stored in folder Data\ExtendedDictionaries) and cannot be changed.
No .ame files are provided with ABBYY FineReader Engine: this is a format for storing a dictionary extension, i.e., words added to the dictionary by the user. You can create a dictionary extension in ABBYY FineReader, where it is called user dictionary, and then copy the created file to the folder Data\ExtendedDictionaries in the ABBYY FineReader Engine folder (or you may specify the full path to it in ILanguageDatabase::DictionaryExtensionsPath property). ABBYY FineReader stores the extensions of standard dictionaries in %appdata%\ABBYY\FineReader\15\FineReaderShell\UserDictionaries. Dictionary extensions can be edited in ABBYY FineReader Engine using ILanguageDatabase::OpenDictionaryExtension method.
This dictionary type is described by the StandardDictionaryDescription object.
Can be created using the Dictionary object. The Dictionary object allows you to add and remove words using its methods and to edit the dictionary with the help of the Dictionary dialog box. This dialog box allows you to import any text file in Windows ANSI and Unicode encoding (the only requirement is that words must be separated by spaces or other non-alphabetic characters).
This dictionary type is described by the UserDictionaryDescription object.
Note: The user dictionary in FineReader Engine has .amd file format and can be created for any language. It can take place of the standard dictionary for languages that do not have dictionary support. The user dictionary of ABBYY FineReader is called dictionary extension in FineReader Engine; it is a .ame file and can be created only for the languages with dictionary support, as an extension of standard dictionary for that language.
Specifies the rules that define what words are allowed in a language and what words are not allowed.
This dictionary type is described by the RegExpDictionaryDescription object.
Allows you to implement your own type of dictionary. This dictionary is represented as the IExternalDictionary interface, which is implemented on the client-side. Guidelines for external dictionary creation you can find in the description of this interface.
This dictionary type is described by the ExternalDictionaryDescription object.
ABBYY FineReader Engine provides a DictionaryDescription object for describing all types of dictionaries. This is the basic object from which the descriptions of different dictionary types are inherited.
All these dictionary descriptions are elements of the DictionaryDescriptions collection.
Creating a dictionary description
To create dictionary descriptions, the AddNew method of the DictionaryDescriptions object is used. This method returns a reference to the DictionaryDescription object. To obtain a reference to the object describing the corresponding dictionary type, use the GetAsStandardDictionaryDescription, GetAsUserDictionaryDescription, GetAsRegExpDictionaryDescription, GetAsExternalDictionaryDescription methods of the DictionaryDescription object.
For each dictionary, the identification property of the dictionary must be specified:
- For a standard dictionary (StandardDictionaryDescription), specify its LanguageId property, which defines the ID of the language.
- For a user dictionary (UserDictionaryDescription), specify its FileName property, which provides the path to the user dictionary.
- For a regular-expression-based dictionary (RegExpDictionaryDescription), use the SetText method to specify the regular expression. See Working with ABBYY FineReader Engine Regular Expressions.
- For an external dictionary (ExternalDictionaryDescription), use the SetDictionary method to specify the dictionary.
All dictionary types are assigned a weight. The weight of a dictionary affects the weight of words from the given dictionary when they are detected during recognition. The weight parameter is a percentage and must be non-negative. A weight of 0 does not automatically mean that there is no such dictionary. Weights of more than 100 percent are allowed, but the user must be very careful when using such parameters. The weight is specified in the IDictionaryDescription::Weight property and is set to 100 by default.
Standard dictionaries also have a CanUseTrigrams option which allows or forbids the program to use trigrams built on the basis of the selected dictionary. Trigrams are combinations of three letters. Not all of these combinations occur in real words. A word with a non-dictionary trigram is very likely to be unpronounceable. Trigrams are used to cut off unreliable words. We recommend enabling trigrams for "general" standard dictionaries and disabling them for dictionaries of terms.
Dictionaries of a recognition language
A text recognition language (the TextLanguage object) can have both dictionaries containing words of the language and dictionaries with prohibited words. The first ones are specified for each basic recognition language of the text language and are accessible via the IBaseLanguage::DictionaryDescriptions property. A base language may have no dictionary attached to it. The prohibiting dictionaries are attached directly to the text recognition language through the ITextLanguage::ProhibitingDictionaries property.
If you want only the dictionary words to be allowed during recognition, set the IBaseLanguage::AllowWordsFromDictionaryOnly property to TRUE. In this case, a word that is not found in the dictionary of the base language can appear in the recognized text only if ABBYY FineReader Engine found no dictionary variants.
How to attach a dictionary to a recognition language
- Create a TextLanguage object using one of the available methods (e.g., the CreateTextLanguage method of the LanguageDatabase object).
- Obtain the collection of base languages of the new text language (use the BaseLanguages property).
- Create a new BaseLanguage object and add it to a collection of base languages.
- Obtain the collection of dictionary descriptions of the new base language (the DictionaryDescriptions property).
- Create a dictionary description and add it to the collection of dictionary descriptions of the base language. Use the AddNew method of the DictionaryDescriptions collection.
Note: You can create several dictionaries of different types and add them to the DictionaryDescriptions collection of one base recognition language.
- [Optional] Specify the weight of the created dictionary.
- Specify the identification property of the dictionary: the LanguageId property for a standard dictionary, the FileName property for a user dictionary, call the IRegExpDictionaryDescription::SetText method for a regular-expression-based dictionary or call the IExternalDictionaryDescription::SetDictionary method for an external dictionary.
- [Optional] Specify other properties of the BaseLanguage object.
- [Optional] Set the prohibiting dictionaries using the ProhibitingDictionaries property of the TextLanguage object.
- Assign the created TextLanguage object to the TextLanguage property of the RecognizerParams object.
To implement these steps, see the code samples below.
C++ (COM) code
// Global ABBYY FineReader Engine object FREngine::IEnginePtr Engine; // A LanguageDatabase object FREngine::ILanguageDatabasePtr languageDatabase; ... // Create a TextLanguage object and receive its collection of base languages FREngine::ITextLanguagePtr pTextLang = languageDatabase->CreateTextLanguage(); FREngine::IBaseLanguagesPtr pBaseLangCollection = pTextLang->BaseLanguages; // Create a BaseLanguage object and receive its collection of dictionary descriptions FREngine::IBaseLanguagePtr pBaseLang = pBaseLangCollection->AddNew(); pBaseLang->InternalName = L"SampleBaseLanguage"; pBaseLang->PutLetterSet( FREngine::BLLS_Alphabet, L"abc123" ); FREngine::IDictionaryDescriptionsPtr pDictDescCollection = pBaseLang->DictionaryDescriptions; // Create a standard dictionary description and add it to the collection FREngine::IDictionaryDescriptionPtr pDicDescription = pDictDescCollection->AddNew( FREngine::DT_SystemDictionary); // [optional] Specify the weight of the created dictionary pDicDescription->Weight = 100; // Specify the identification property of the dictionary FREngine::IStandardDictionaryDescriptionPtr pStandardDic = pDicDescription->GetAsStandardDictionaryDescription(); pStandardDic->LanguageId = FREngine::LI_EnglishUnitedStates; // [optional] Specify other properties of the BaseLanguage base language pBaseLang->AllowWordsFromDictionaryOnly = VARIANT_TRUE; // Create a RecognizerParams object FREngine::IRecognizerParamsPtr pParams = Engine->CreateRecognizerParams(); // Assign the created TextLanguage object to the TextLanguage property pParams->TextLanguage = pTextLang; ...
// Global ABBYY FineReader Engine object FREngine.IEngine engine; // A LanguageDatabase object FREngine.ILanguageDatabase languageDatabase; ... // Create a TextLanguage object and receive its collection of base languages FREngine.ITextLanguage TextLang = languageDatabase.CreateTextLanguage(); FREngine.IBaseLanguages BaseLangCollection = TextLang.BaseLanguages; // Create a BaseLanguage object and receive its collection of dictionary descriptions FREngine.IBaseLanguage BaseLang = BaseLangCollection.AddNew(); BaseLang.InternalName = "SampleBaseLanguage"; BaseLang.set_LetterSet( FREngine.BaseLanguageLetterSetEnum.BLLS_Alphabet, "abc123" ); FREngine.IDictionaryDescriptions DictDescCollection = BaseLang.DictionaryDescriptions; // Create a standard dictionary description and add it to the collection FREngine.IDictionaryDescription DicDescription = DictDescCollection.AddNew( FREngine.DictionaryTypeEnum.DT_SystemDictionary ); // [optional] Specify the weight of the created dictionary DicDescription.Weight = 100; // Specify the identification property of the dictionary FREngine.IStandardDictionaryDescription StandardDic = DicDescription.GetAsStandardDictionaryDescription(); StandardDic.LanguageId = FREngine.LanguageIdEnum.LI_EnglishUnitedStates; // [optional] Specify other properties of the BaseLanguage base language BaseLang.AllowWordsFromDictionaryOnly = true; // Create a RecognizerParams object FREngine.IRecognizerParams recParams = engine.CreateRecognizerParams(); // Assign the created TextLanguage object to the TextLanguage property recParams.TextLanguage = TextLang; ...
A cache dictionary is a small dictionary (about a hundred words) that can be changed easily during processing. Cache dictionaries can be used when it is possible to select a dictionary more precisely, e.g., if you find new information about the document during processing. Such dictionaries are suitable for field-level recognition.
For example, suppose there are two fields on a form you need to recognize: the name of a city and the name of a street. You have recognized the name of the city, and you have the list of streets in this city. In this case, you may load the appropriate cache dictionary with the street names and thus recognize the name of the street more quickly and accurately.
ABBYY FineReader Engine provides the AddWordsToCacheDictionary, AddWordToCacheDictionary, and CleanCacheDictionary methods of the FRPage object for working with cache dictionaries.
Important! To use the cache dictionary, you should set the IEngine::AutoCleanRecognizerSession property to FALSE. The AutoCleanRecognizerSession property is set to TRUE by default, which means that FineReader Engine cleans its recognition session after recognition of each page, in which case the cache dictionary is cleaned too. To prevent accidental destruction of user data, FineReader Engine prohibits using of cache dictionaries in this mode. If you use the cache dictionary, it is your concern to clean the recognition session manually by calling the IEngine::CleanRecognizerSession method when necessary. See the description of the method to find out when it is necessary to clean the recognition session.
3/24/2023 8:51:52 AM