Dialog Box: HTML

This dialog box allows you to specify HTML format settings.

Set parameters for saving the recognized text into an HTML file:

Option name Option description
Retain layout group

Retain layout

(drop-down list)

Sets layout retention. The following choices are available:

  • Retain full page layout
    Select this option if you wish the recognition results look exactly like the original document.
  • Retain font and font size
    This option will retain paragraphs and fonts' types and sizes. The text formatting inside paragraphs is not retained.
  • Remove all formatting
    Select this option if you need the content of the original document, but do not need to retain the layout of the document.
Keep pictures

Select this option to keep pictures in the recognized text.

The option is set by default.

Note. The format in which pictures are saved in the output file is selected automatically based on two picture properties: Color Type (black and white, grayscale or color) and Color Variety (low or high). Black-and-white pictures are always saved in PNG format. Grayscale and color pictures are saved in PNG format in the case of low color variety, and in JPEG format in the case of high color variety.

Format group
Use CSS If you select this option, the built-in style sheet (CSS) will be used. CSS requires Internet Explorer 4.0 or later.
Keep headers and footers

If the source document contains a header and/or a footer, selecting this option will add the header to the top of the page and the footer to the bottom of the page. If this option is not selected, the header and footer will not be added. This option is useful for documents with page numbering that you want to omit from your single-page document.

Note. This option is enabled by default.

Character encoding group

Encoding type

(drop-down list)

Specifies the encoding type of the output file:

  • Simple
    Simple encoding, one byte per symbol.
  • Unicode UTF-16
    Native Unicode format where every symbol is represented by two-byte sequence.
  • Unicode UTF-8
    Unicode UTF-8 format. UTF-8 is a code page that uses a string of bytes to represent a 16-bit Unicode string where ASCII text (<=U+007F) remains unchanged as a single byte, U+0080-07FF (including Latin, Greek, Cyrillic, Hebrew, and Arabic) is converted to a 2-byte sequence, and U+0800-FFFF (Chinese, Japanese, Korean, and others) becomes a 3-byte sequence.

Code page

(drop-down list)

By default the code page is detected automatically. Select the (Automatic) value to use the automatic detection. Still, you may select the code page manually if necessary, just choose the value you need from the list.
Remove existing document metadata Removes the original metadata from the document, including title, author, tags, etc.

See also

Output Format Settings Dialog Box

29.08.2023 11:55:30

Please leave your feedback about this article

Usage of Cookies. In order to optimize the website functionality and improve your online experience ABBYY uses cookies. You agree to the usage of cookies when you continue using this site. Further details can be found in our Privacy Notice.