Recognizing Chinese, Japanese, and Korean Languages
Chinese, Japanese, and Korean languages are often grouped together under the abbreviation "CJK". They have several features in common, such as the use of Chinese characters and of vertical as well as horizontal writing direction.
This section deals with certain peculiarities of recognizing and exporting texts in CJK languages with ABBYY FineReader Engine 12.
First, in order to recognize CJK languages, you must have an ABBYY FineReader Engine license that supports the Chinese, Japanese, and Korean language modules. For more information about licenses and modules, see the Licensing section.
Recognition languages
ABBYY FineReader Engine supports the following predefined recognition languages for CJK texts:
- "ChinesePRC"
- "ChineseTaiwan"
- "Japanese"
- "JapaneseModern"
- "Korean"
- "KoreanHangul"
To select one of these predefined languages, you can use the SetPredefinedTextLanguage method of the RecognizerParams object.
Important! Japanese (Modern) recognition language is a compound language consisting of the Japanese and English languages and four letters of the Greek language. This language is intended for recognizing contemporary Japanese texts, which may include some Kanji characters, Kana (Katakana or Hiragana) symbols, some Latin and/or Greek letters (such as reports, research papers, etc.). To get the best recognition results for all documents written primarily in Japanese, we strongly recommend using the Japanese (Modern) recognition language as an independent language, without using its combinations with the English language.
ABBYY FineReader Engine supports recognition language combinations consisting of several of these languages or combinations of CJK and other languages.
Fonts
To prevent garbling of Asian characters, you must specify for document synthesis a font that includes the necessary set of characters, e.g., Arial Unicode MS, SimSun. You can set the font with the help of the ISynthesisParamsForDocument::FontSet property. The SystemFontSet property of the FontSet object is set by default to selecting those of the system fonts which correspond to the recognition languages of the document.
Export
You can export CJK languages to PDF/A in "text under the image" mode (IPDFExportParams::TextExportMode = PEM_ImageOnText) to ensure that the document looks the same.
The procedure of recognition and export
To process documents written in CJK languages, do the following:
- Create a DocumentProcessingParams object using the CreateDocumentProcessingParams method of the Engine object.
- Specify the recognition language. Use the SetPredefinedTextLanguage method of the RecognizerParams subobject of the PageProcessingParams subobject.
- Select the font set suitable for CJK languages. Use the ISynthesisParamsForDocument::FontSet property of the SynthesisParamsForDocument subobject.
- Pass the configured DocumentProcessingParams object to the Process method of the FRDocument object. If you use methods of the Engine object, you should call one of the synthesis methods of the Engine object with the configured SynthesisParamsForDocument object as a parameter before export.
- Perform export of the recognized text with the help of the Export method of the FRDocument object. If you export to PDF of PDF/A format, specify the required export mode.
Note: Do not use the Word object and its properties or the IsWordFirst, IsWordLeftmost properties of the CharParams object for the texts written in CJK languages. The processing technology divides the text lines into "words" only for internal purposes, and those groups of symbols do not coincide with the actual words.
C++ code
See also
7/3/2024 8:50:10 AM