Working with Languages
One of the main recognition parameters is the language which is used during recognition. It is important to set the right language before analysis and recognition. Recognition language can be easily specified with the help of the IRecognizerParams::SetPredefinedTextLanguage method. This method affects the IRecognizerParams::TextLanguage property. By default, this parameter is initialized with the English recognition language. You can also use language autodetection (see the IRecognizerParams::LanguageDetectionMode property for details).
Important! Language autodetection deals only with the predefined languages (see the full list in Predefined Languages in ABBYY FineReader Engine).
Below you can find useful information about the languages supported in ABBYY FineReader Engine by default and objects that provide advanced functionality for working with recognition languages.
Predefined languages
ABBYY FineReader Engine provides a set of languages supported by default. These languages are called "predefined languages." The collection of available predefined languages represented by the PredefinedLanguages object is accessible via the PredefinedLanguages property of the Engine object. It is a collection of PredefinedLanguage objects.
The predefined languages are identified by their internal names. You may directly specify a recognition language by the name of the corresponding predefined language via the IRecognizerParams::SetPredefinedTextLanguage method. For the list of the internal names of the predefined languages, see Predefined Languages in ABBYY FineReader Engine.
Recognition language for a text
The language which is used during recognition is represented by the TextLanguage object. The RecognizerParams object that specifies the recognition parameters stores a reference to the TextLanguage object. The recognition functions take this object either as a subobject of the PageProcessingParams object passed to them as an input parameter or from a block in a Layout object.
The TextLanguage object exposes the following main properties:
- Internal name. We recommend selecting a unique name for the internal language; it is already unique for the languages supplied in the ABBYY FineReader Engine distribution pack. Be sure to make the names of new languages unique.
- Letter sets. The TextLanguage object contains the following letter sets: punctuation marks that may be encountered between words, prohibited characters, and additional punctuation marks that go immediately before and after words.
- Prohibiting dictionaries. You can create a collection of prohibiting dictionaries using the ProhibitingDictionaries property of the TextLanguage object. The words from these dictionaries cannot be used as variants of a recognized word. But if no variants are left and using a prohibited word is the only option, words from these dictionaries may still appear in the recognized text. See Working with Dictionaries.
Recognition language for characters
During recognition, the text is separated into words, with one or several recognition languages corresponding to each word. One recognition language is assigned to each character in a word. This recognition language is represented by the BaseLanguage object and is accessible via the ITextLanguage::BaseLanguages property.
The BaseLanguage object has the following properties:
- Internal name. We recommend selecting a unique name for the internal language; it is already unique for the languages supplied with the ABBYY FineReader Engine distribution pack. Be sure to make the names of new languages unique.
If one base recognition language corresponds to one recognized word, the ICharParams::LanguageName property for each character in this word is set to the internal name of the base language after recognition. If several base recognition languages correspond to one word (e.g., for bilingual compound words), the ICharParams::LanguageName property for the characters in this word is empty. The ICharParams::LanguageId property contains the identifier of the base language no matter what the recognized word is.
- Letter sets. A letter set comprises letters that form the alphabet of the language, letters that form its extended alphabet (used in loan words), punctuation marks that go immediately before and after words, characters that are allowed inside words but are ignored by the internal spelling check system, and symbols allowed in subscript and superscript.
- Dictionary. A recognition language for a word may have a dictionary attached to it. See Working with Dictionaries.
Creating a compound recognition language
ABBYY FineReader Engine provides an easy way to create compound recognition languages made up of several predefined recognition languages. This is done via the LanguageDatabase object. For example, you may create a recognition language that includes both English and German words:
- Create a LanguageDatabase object by calling the IEngine::CreateLanguageDatabase method.
- Call the ILanguageDatabase::CreateCompoundTextLanguage method with the parameter "English,German".
- Use the received TextLanguage object for text recognition.
The LanguageDatabase object also allows you to import custom user-defined languages created in ABBYY FineReader Engine Visual Components. The Visual Components provide the means for creating custom recognition languages with letter sets, dictionaries, and other parameters specified by the user. The recognition languages created in this way are stored in a set of files and may be accessed by using the LanguageDatabase object. If you wish to use the languages created in Visual Components, do the following:
- Create a LanguageDatabase object by calling the IEngine::CreateLanguageDatabase method.
- Load the languages into the LanguageDatabase object using the ILanguageDatabase::LoadFrom method.
- Get the required language by its name as a TextLanguage object from the LanguageDatabase object.
- Use the received TextLanguage object for text recognition.
See also
9/17/2024 3:14:40 PM