Document XML Scheme

When recognizing a page, ABBYY FineReader Server first analyzes its layout and detects blocks of various types on the page. Each block on the page belongs to one of the four types described below, and has its own sequence number and region (a region is a set of rectangles on the image positioned one under another in such a way that the top line of the lower rectangle is the bottom line of the upper one, so that the rectangles do not overlap). Blocks determine how and in what order the image areas should be recognized.

The following block types are supported:

Text - This is used for text image areas and should only contain single-column text. The recognized text is enclosed with text tags in the XML file. Text is represented as a set of paragraphs (each paragraph is enclosed with par tags). In a paragraph, each line is marked by line tags. For a line, formatting attributes are shown (formatting tags). Character attributes are represented in charParams tag attributes.

Table - This is used for table image areas or for areas of text that have the structure of a table. The recognized table is represented in the XML file1 by a set of rows (row tags). In a row, each cell is marked by cell tags. Cell text is enclosed with text tags.

Picture - This is used for image areas that contain pictures. This type of block may enclose an actual picture or any other object that should be displayed as a picture (e.g. a section of text). A picture block is only represented as a block region (region tags) in the XML file.

Barcode - This is used for barcode image areas. The recognized barcode is represented in the XML file by the barcode value (if the LookForBarcodes property of the RecognitionParams object  is set to TRUE). The barcode value is enclosed with text tags.

For the XML scheme of an XML document, see the ExportToXml.xsd file, which can be found in the Help subfolder of the ABBYY FineReader Server installation folder (the default location is C:\Program Files\ABBYY FineReader Server 14.0\Help).

Note. When working with a page on the FineReader Server 14 Verification Station, blocks are shown as image areas enclosed in frames of different colors, as on the picture below.

The picture below shows Picture, Text, and Table blocks in the output XML file.

Description of Tags
Name Type Multiplicity Parent Tag Description
document

Complex Type, a sequence of page tags

Type attributes

1 no Document
page

Complex Type, a sequence of BlockType tags

Type attributes

0...unbounded document Recognized page
block

BlockType

BlockType attributes

0...unbounded page Recognized block
region

Complex Type, a sequence of rect tags

Has no type attributes.

1 block Block region, a set of rectangles
rect

Complex Type

Type attributes

1...unbounded region Rectangle
text

TextType

TextType attributes

0...1 block Recognized block text (present if blockType attribute is Text)
par

ParagraphType

ParagraphType attributes

0...unbounded text Text paragraph
line

LineType

LineType attributes

0...unbounded par Text paragraph line
formatting

FormattingType

FormattingType attributes

0...unbounded line Group of characters with uniform formatting
charParams

CharParamsType

CharParamsType attributes

0...unbounded formatting Attributes of a single character
row

TableRawType

Has no type attributes

0...unbounded block The set of table rows (present if blockType attribute is Table)
cell

Complex Type, a sequence of TextType tags

Type attributes

0...unbounded row Table cell
text TextType 0...unbounded cell Cell text
See also

COM-based API: XMLExportSettings

Web Services API: XMLExportSettings

26.03.2024 13:49:49

Please leave your feedback about this article

Usage of Cookies. In order to optimize the website functionality and improve your online experience ABBYY uses cookies. You agree to the usage of cookies when you continue using this site. Further details can be found in our Privacy Notice.