Chinese Simplified (简体中文)

Data Extraction

Note: Some parts of this article may be in English. We apologize for the inconvenience and are working on adding the translation as soon as possible.

This scenario is used to extract all possible data from a document and store it in a structured way.

The result is a JSON file which represents the document structure. It stores all document objects: printed and handwritten text, tables, barcodes, checkmarks, and images with their location and attributes. This format is optimal for further processing, storing data in a database, or integrating with another application.

A document goes through several processing steps in this scenario:

  1. 对扫描的图像或照片进行预处理

通过扫描仪或数码相机获得的图像可能需要进行一些调整,才能进行光学识别。例如,噪声图像或有扭曲文本行的图像需要作出某些修正,才能成功地进行光学识别。

  1. Extracting all the data on the document in a structured way

During layout analysis, various objects are detected on the image and put into blocks of corresponding type. The blocks are recognized according to the optimal settings of their type. In the course of synthesis, the logical structure of the document is restored in a consistent manner. The text order even for complex layouts is preserved to be similar to how a human would read it. This ensures that re-recognition of the same document would result in the same order of the text.

  1. Export to a structured format

The recognized document is saved to JSON or XML.

场景实现

Below you will find a detailed description of the recommended method of using ABBYY FineReader Engine 12 to extract the data from documents. 建议方法中采用了最能实现这一用途的处理设置。

第1步加载 ABBYY FineReader Engine

第2步加载方案设置

第3步加载和预处理图像

第4步文档识别

第5步文档导出

第6步卸载 ABBYY FineReader Engine

所需资源

您可以使用 FREngineDistribution.csv 文件来自动创建应用程序正常工作所需的文件列表。若要用该方案进行处理,请在栏5 (RequiredByModule) 中对以下值进行选择:

Core

Core.Resources

Opening

Opening, Processing

Processing

Processing.OCR

Processing.OCR, Processing.ICR

Processing.OCR.NaturalLanguages

Processing.OCR.NaturalLanguages, Processing.ICR.NaturalLanguages

Export

Export, Processing

如果修改标准场景,请相应更改所需的模块。您还需要指定界面语言、识别语言和应用程序使用的任何其他功能( 例如,如果需要打开 PDF 文件,则使用 Opening.PDF;如果需要识别 CJK 语言中的文本,则使用 Processing.OCR.CJK)。请参阅 借助 FREngineDistribution.csv 文件处理 进一步了解详情。

对于具体任务的额外优化

以下是指南文件中的相关章节,您可以在其中找到不同处理步骤参数设置的更多信息:

  • 扫描
    • 扫描
      ABBYY FineReader Engine 文档扫描场景说明。
  • 识别
    • 微调页面预处理、分析、识别和合成参数
      使用分析、识别和合成参数的对象自定义文档处理。
    • PageProcessingParams 对象
      该对象可以自定义分析和识别参数。通过该对象,您可以指出必须检测哪些图像和文本特征(反转图像、方向、条形码、识别语言、识别误差容限)。
    • SynthesisParamsForPage 对象
      该对象包含负责在合成期间恢复页面格式的参数。
    • SynthesisParamsForDocument 对象
      该对象可以自定义文档合成:恢复其结构和格式。
    • MultiProcessingParams 对象
      在处理大量图像时,同时处理可能会很有用。在此情况下,图像打开和预处理、布局分析、识别和导出期间,处理负载在处理器内核之间分布,从而可以加快处理速度。
      读取模式(同时或者连续)使用 MultiProcessingMode 属性进行设置。RecognitionProcessesCount 属性控制可被启动的进程数量。
  • 导出

另请参阅

基本使用场景实现

07.11.2025 12:48:30

Usage of Cookies. In order to optimize the website functionality and improve your online experience ABBYY uses cookies. You agree to the usage of cookies when you continue using this site. Further details can be found in our Privacy Notice.