An XML file contains the recognized text, with additional information about its structure, attributes and recognition variants described with the help of XML tags. See the table below for the description of possible tags. Some tags may not be present according to the values of the recognition parameters. For example, word or character recognition variants will only be saved if the corresponding properties of the XMLExportParams object are set to TRUE.
The picture below shows the example of Picture, Text, and Table blocks in the output XML file.
Name |
Type |
Multiplicity |
Parent Tag |
Description |
document |
Complex Type
Type attributes
version — XML version
producer — the producer of the XML file
pagesCount — (optional) the number of pages in the document
mainLanguage — (optional) the main language of the document
languages — (optional) all languages of the document
|
1 |
no |
Document. |
page |
Complex Type, a sequence of block tags
Type attributes
width — the image width in pixels
height — the image height in pixels
resolution — the image resolution in pixels per inch
originalCoords — (optional) if true, all coordinates are relative to the original image before opening; otherwise, they are relative to the opened (deskewed) image
rotation — (optional) the type of rotation applied to original page image before processing. It can be one of the following values: Normal, RotatedClockwise, RotatedUpsideDown, RotatedCounterclockwise
|
0...unbounded |
document |
Recognized page. |
block |
BlockType
BlockType elements
Availability of this or that element depends on the type of the block (see blockType attribute).
region — always available
text — available only if blockType attribute is "Text"
row — available only if blockType attribute is "Table"
separatorsBox — available only if blockType attribute is "SeparatorsBox"
separator — available only if blockType attribute is "Separator"
BlockType attributes
blockType — the type of the block. It can be one of the following values: Text, Table, Picture, Barcode, Separator, SeparatorsBox, Checkmark, GroupCheckmark
blockName — (optional) the name of the block
isHidden — (optional) specifies if the block is hidden (the default value is false)
l — (optional) the coordinate of the left border of the block
t — (optional) the coordinate of the top border of the block
r — (optional) the coordinate of the right border of the block
b — (optional) the coordinate of the bottom border of the block
|
0...unbounded |
page |
Recognized block. |
region |
Complex Type, a sequence of rect tags
Has no type attributes
|
1 |
block |
Block region, a set of rectangles. |
rect |
Complex Type
Type attributes
l — the coordinate of the left border of the rectangle
t — the coordinate of the top border of the rectangle
r — the coordinate of the right border of the rectangle
b — the coordinate of the bottom border of the rectangle
|
1...unbounded |
region |
Rectangle of a block region. |
text |
TextType
TextType attributes
orientation — (optional) the text orientation. It can be one of the following values: Normal, RotatedClockwise, RotatedUpsidedown, RotatedCounterclockwise (the default value is Normal)
backgroundColor — (optional) the background color of the text (the default value is -1, which means that the color is transparent)
mirrored — (optional) specifies if the text is mirrored (the default value is false)
inverted — (optional) specifies if the text is inverted (the default value is false)
|
0...unbounded |
block |
Text of a recognized text block (presents as an element of block tag, if blockType attribute is "Text"). |
0...unbounded |
cell |
Text of a table cell. |
par |
ParagraphType
ParagraphType attributes
dropCapCharsCount — (optional) the number of drop caps in the paragraph (the default value is 0)
dropCap-l — (optional) the left coordinate of the drop cap rectangle
dropCap-t — (optional) the top coordinate of the drop cap rectangle
dropCap-r — (optional) the right coordinate of the drop cap rectangle
dropCap-b — (optional) the bottom coordinate of the drop cap rectangle
align — (optional) the paragraph aligning. It can be one of the following values: Left, Center, Right, Justified (the default value is Left)
leftIndent — (optional) the left paragraph indent (the default value is 0)
rightIndent — (optional) the right paragraph indent (the default value is 0)
startIndent — (optional) the indent of the first line of the paragraph (default value is 0)
lineSpacing — (optional) the spacing between lines (the default value is 0)
isListItem — (optional) indicates that the paragraph is part of a list (the default value is false)
lstLvl — (optional) the list level
lstNum — (optional) the number of the paragraph in the list
|
0...unbounded |
text |
Paragraph of a recognized text. |
line |
LineType
LineType attributes
baseline — the distance from the base line to the top edge of the page
l — the coordinate of the left border of the surrounding rectangle
t — the coordinate of the top border of the surrounding rectangle
r — the coordinate of the right border of the surrounding rectangle
b — the coordinate of the bottom border of the surrounding rectangle
|
0...unbounded |
par |
Line of a paragraph. |
formatting |
FormattingType
FormattingType group
charParams
or
wordRecVariants
FormattingType attributes
lang — the name of the language
ff — (optional) the name of the font
fs — (optional) the size of the font
bold — (optional) the bold font style (the default value is false)
italic — (optional) the italic font style (the default value is false)
subscript — (optional) the subscript font effect (the default value is false)
superscript — (optional) the superscript font effect (the default value is false)
smallcaps — (optional) the small caps font effect (the default value is false)
underline — (optional) the underline font effect (the default value is false)
strikeout — (optional) the strikeout font effect (the default value is false)
color — (optional) the color of the font (the default value is 0)
scaling — (optional) the scaling of the font (the default value is 1000)
spacing — (optional) the character spacing (the default value is 0)
|
0...unbounded |
line |
Group of characters with uniform formatting. Attributes of characters are alternated with word's recognition variants. The variants of recognition of the word are written before the word. |
charParams |
CharParamsType
CharParamsType attributes
l — the coordinate of the left border of the character rectangle
t — the coordinate of the top border of the character rectangle
r — the coordinate of the right border of the character rectangle
b — the coordinate of the bottom border of the character rectangle
suspicious — (optional) this property set to TRUE means that the character was recognized uncertainly (the default value is false)
proofed — (optional) specifies whether spell-checking was performed upon this character (the default value is false)
wordStart — deprecated; (optional) this property set to TRUE marks the leftmost character in a word
wordFirst — (optional) this property set to TRUE marks the first character in a word
wordLeftmost — (optional) this property set to TRUE marks the leftmost character in a word
wordFromDictionary — (optional) specifies whether the word was found in the dictionary
wordNormal — (optional) specifies whether the word was recognized with either a standard or user-defined language, and that it is not a number or an identifier
wordNumeric — (optional) specifies whether the word is a number
wordIdentifier — (optional) specifies whether the word is an identifier (abbreviation, URL, etc.)
wordPenalty — (optional) penalty for discordance of characters in the word
meanStrokeWidth — (optional) the mean width of the stroke in the RLE representation of the word image, expressed in pixels multiplied by 10
charConfidence — (optional) stores the value of character confidence. It is a numerical estimate of the probability that the recognition was correct. However, this number is not guaranteed to be positive, and the only meaningful use of confidence is to compare different recognition variants of the same character
serifProbability — (optional) specifies the probability that the character is written with a Serif font
isTab — (optional) specifies if the character is a tab
tabLeaderCount — (optional) specifies symbols quantity in the tab leader. The quantity is calculated at the synthesis stage considering font and tab width. This attribute is used if isTab=TRUE
|
0...unbounded |
formatting |
Attributes of a single character. |
charRecVariants |
Complex Type, a sequence of charRecVariant tags
Has no type attributes
|
|
charParams |
Variants of a character recognition. |
charRecVariant |
CharRecognitionVariant
Type attributes
charConfidence — (optional) a numerical estimate of the probability that the recognition was correct
serifProbability — (optional) probability that a character is written with a Serif font
|
0...unbounded |
charRecVariants |
Variant of a character recognition. |
wordRecVariants |
Complex Type, a sequence of wordRecVariant tags
Has no type attributes
|
|
formatting |
Variants of recognition of the next word. |
wordRecVariant |
WordRecognitionVariant type
WordRecognitionVariant elements
WordRecognitionVariant attributes
wordFromDictionary — (optional) specifies whether the word was found in the dictionary
wordNormal — (optional) specifies whether the word was recognized with a standard or user-defined language, and that it is not a number or an identifier
wordNumeric — (optional) specifies whether the word is a number
wordIdentifier — (optional) specifies whether the word is an identifier (abbreviation, URL, etc.)
wordPenalty — (optional) penalty for discordance of characters in the word
meanStrokeWidth — (optional) the mean width of the stroke in the RLE representation of the word image, expressed in pixels multiplied by 10
|
0...unbounded |
wordRecVariants |
Variant of recognition of the next word. |
variantText |
Complex Type, a sequence of charParams tags
Has no type attributes
|
1 |
wordRecVariant |
Word. |
row |
TableRowType
Has no type attributes
|
0...unbounded |
block |
Table row (presents if blockType attribute is Table). |
cell |
Complex Type, a sequence of TextType tags
Type attributes
colSpan — (optional) the column span of the cell (the default value is 1)
rowSpan — (optional) the row span of the cell (the default value is 1)
align — (optional) this property specifies alignment for a tab stop and can be one of the following values: Top, Center, Bottom (the default value is Top)
picture — (optional) specifies if the cell contains only a picture (the default value is false)
leftBorder — (optional) the table cell left border type. It can be one of the following values: Absent, Unknown, White, Black (the default value is Black)
topBorder — (optional) the table cell top border type. It can be one of the following values: Absent, Unknown, White, Black (the default value is Black)
rightBorder — (optional) the table cell right border type. It can be one of the following values: Absent, Unknown, White, Black (the default value is Black)
bottomBorder — (optional) the table cell bottom border type. It can be one of the following values: Absent, Unknown, White, Black (the default value is Black)
width — the width of the cell
height — the height of the cell
|
0...unbounded |
row |
Table cell (presents if blockType attribute is Table). |
separatorsBox |
Complex Type, a sequence of separator tags
Has no type attributes
|
0...1 |
block |
Group of separators, presents if blockType attribute is "SeparatorsBox" |
separator |
SeparatorBlockType type
SeparatorBlockType elements
SeparatorBlockType attributes
thickness — specifies the precise width of the separator in pixels
type — specifies the type of the separator. It can be one of the following values: Unknown, Black, Dotted
|
0...1 |
block |
Single separator, presents if blockType attribute is "Separator". |
0...unbounded |
separatorsBox |
Separator in a group of separators. |
barcodeInfo |
BarcodeInfoType type
BarcodeInfoType attributes
type — specifies the type of the barcode. It can be one of the following values:
- CODE39
- INTERLEAVED25
- EAN13
- CODE128
- EAN8
- PDF417
- CODABAR
- UPCE
- INDUSTRIAL25
- IATA25
- MATRIX25
- CODE93
- POSTNET
- UCC128
- PATCH
- AZTEC
- DATAMATRIX
- QRCODE
- UPCA
- MAXICODE
- CODE32
- FULLASCII
- ROYAL
- KIX
- INTELLIGENT
- AUSTRALIA_POST
- Unknown
supplement — (optional) specifies the type of supplementary barcode. It can be one of the following values: void, 2dig, 5dig
|
0...1 |
block |
Information about barcode, presents if blockType attribute is "Barcode". |
start |
Point type
Point attributes
x — specifies the horizontal coordinate of the start point of separator
y — the vertical coordinate of the start point of separator
|
1 |
separator |
Start point of a separator. |
end |
Point type
Point attributes
x — specifies the horizontal coordinate of the end point of separator
y — the vertical coordinate of the end point of separator
|
1 |
separator |
End point of a separator. |
documentData |
Complex Type
Has no type attributes
|
0...1 |
document |
Parameters of paragraph and font styles of the document. |
paragraphStyles |
Complex Type, a sequence of paragraphStyle tags
Has no type attributes
|
0...1 |
documentData |
Collection of paragraph formatting styles. |
paragraphStyle |
ParagraphStyleType Type
ParagraphStyleType elements
ParagraphStyleType attributes
id — the identifier of the paragraph
name — the name of the paragraph style
mainFontStyleId — the main font style of the paragraph
role — the paragraph role. It can be one of the following values:
- text
- tableText
- heading
- tableHeading
- pictureCaption
- tableCaption
- contents — table of contents
- footnote
- endnote
- rt — running title
- garb — garbage
- other
- barcode
- headingNumber
roleLevel — (optional) (the default value is -1, which means that the level is not available for this role)
align — paragraph alignment. It can be one of the following values: Left, Center, Right, Justified, CjkJustified, ThaiJustified
before — (optional) space before the paragraph of this style (the default value is 0)
after — (optional) space after the paragraph of this style (the default value is 0)
startIndent — (optional) indent of the first line of the paragraph
leftIndent — (optional) left indent of the whole paragraph
rightIndent — (optional) right indent of the whole paragraph
lineSpacing — (optional) line spacing
lineSpacingRatio — (optional) line spacing (proportional to the letter height)
fixedLineSpacing — (optional) if true, the line spacing in the paragraph does not vary
|
0...unbounded |
paragraphStyles |
Formatting style of a paragraph. |
fontStyle |
FontStyleType Type
FontStyleType attributes
id — the identifier of the font style
baseFont — (optional)
italic — (optional) if true, the font is italic
bold — (optional) if true, the font is bold
underline — (optional) if true, the font is underlined
strikeout — (optional) if true, the font is strikeout
smallcaps — (optional) if true, the font is small caps
scaling — (optional) the scaling of the font (the default value is 1000)
spacing — (optional) the character spacing (the default value is 0)
color — (optional) the color of the font (the default value is 0)
backgroundColor — (optional) the background color (the default value is 0)
ff — the name of the font
fs — the size of the font
|
0...unbounded |
paragraphStyle |
The font style. |
sections |
Complex Type, a sequence of section tags
Has no type attributes
|
0...1 |
documentData |
The collection of document sections. |
section |
SectionType Type
Has no type attributes
|
0...unbounded |
sections |
A document section. |
stream |
TextStreamType Type
TextStreamType attributes
role — (optional) the stream role. It can be one of the following values: garb, text, footnote, incut (the default value is text)
vertCjk — (optional) if true, the stream contains vertical CJK text
beginPage — the number of page on which the stream begins
endPage — (optional) the number of page on which the stream ends
|
0...unbounded |
section |
A sequence of paragraphs and blocks. |
mainText |
Complex Type
Type attributes
rtl — (optional) if true, the text has right-to-left writing direction
columnCount — the number of columns
|
0...1 |
stream |
|
elemId |
Complex Type
Type attributes
id — string ID of the element
|
0...unbounded |
stream |
The ID of a page element. |