Working with Text
The text that ABBYY FineReader Engine works with is plain text, i.e., it does not contain frames, tables, and so on. All characters are Unicode. Plain text may contain the following special characters:
- 0x2028 — Line break symbol
- 0x2029 — Paragraph break symbol
- 0xFFFC — Object replacement character (denotes an embedded picture inside the text)
- 0x0009 — Tabulation
- 0x005E — Circumflex accent (^), used by ABBYY FineReader Engine as a replacement for unrecognized characters
- 0x00AC — Soft hyphen
The attributes and formatting of a text are available via the corresponding objects and properties. You can access the recognized text of a document via its page layout (IFRPage::Layout property).
Recognized text in the layout
Only text, table, and barcode blocks contain text after recognition. Other blocks have no text. The Text object provides access to the recognized text of text and table blocks, while the BarcodeText object provides access to the text of a barcode block.
To access the recognized text of a block, do the following:
- For text blocks
Use the ITextBlock::Text property.
- For table blocks
- Receive the collection of table cells using the ITableBlock::Cells property.
- Select the desired cell. Use the methods of the TableCells object.
- Receive the block object of the cell (the ITableCell::Block property).
- Check that the block is of type BT_Text (the IBlock::Type property) and receive the TextBlock object using the IBlock::GetAsTextBlock method.
- Use the ITextBlock::Text property.
- For barcode blocks
Receive the barcode text using the IBarcodeBlock::BarcodeText or IBarcodeBlock::Text property. The first one returns the BarcodeText object, which is a collection of characters of the recognized barcode (the BarcodeSymbol objects). The second one returns the text of the barcode as a single string.
Text and paragraphs
The Text object contains a collection of paragraphs. This collection is a Paragraphs object accessible via the Paragraphs property of the Text object. The Paragraphs object is a collection of Paragraph objects. The recognized text is accessible via the IParagraph::Text property. The text in the property is a Unicode string.
There also exists a ParagraphParams object that contains attributes specific to the whole paragraph, such as information on its alignment and indent. This object is accessible via the IParagraph::ExtendedParams property.
The IParagraph::Lines property provides access to a collection of paragraph lines represented by the ParagraphLines object, which, in turn, is a collection of ParagraphLine objects. The latter provides information on the geometrical position of a single paragraph line and so represents the division of the text into lines.
The IParagraph::Words property provides access to a collection of paragraph words represented by the Words object, which is a collection of Word objects. The Word object provides access to a single word of the paragraph.
Character attributes
Each character of the text has its own parameters. They are accessible via the CharParams object. The CharParams object has a large set of character attributes such as its geometrical parameters, its font, and language. The CharParams object contains the character itself in the SelectedCharacterRecognitionVariant property. See also Plain text below for the recommended way of getting information on all characters in a document.
The position of a character in the text is defined by the index of its paragraph and its own index in this paragraph. There also exists a so-called "special position" in the text: the index of the paragraph is the total number of paragraphs, and the index of the character is 0. This is the insertion point at the end of the text. Some methods of the Text object perform operations with the special position, i.e., insert another text fragment or picture in it.
The SelectedCharacterRecognitionVariant property of the CharParams object provides access to an extended set of attributes specific to a single character, represented by the CharacterRecognitionVariant object. These attributes are set during the recognition and provide some internal recognition information specific to the character. In particular, this object provides more precise information on character recognition certainty, the probability that the character is in a serif font, etc.
Text editing
You may try changing the attributes of the Text object, but you should do it very carefully if the text is to be exported into an external format. The ABBYY FineReader Engine export methods assume that the recognized text is the result of recognition and that the user only corrected the recognition errors and made no other changes. The objects of the Text group have a lot of interdependent properties, and often changing one of these properties requires changing others as well. For this reason, changes in the recognized text's attributes may sometimes result in unpredictable export results.
Plain text
You can also access the full recognized text of a document or a page in a special "plain text" format, represented by the PlainText object. It provides information only about the recognized text symbols, their recognition confidence, and positions as relative to the source image. However, it may be useful if you need to access these attributes for all the characters in a large document. Use its GetCharacterData method, which allows you to receive the data on all characters at once and iterate through it on your side. Iterating through all text blocks will take up significantly more time, especially if your application is working via DCOM.
See also
17.09.2024 15:14:40