Chinese Simplified (简体中文)

English (English)

文本提取

Note: Some parts of this article may be in English. We apologize for the inconvenience and are working on adding the translation as soon as possible.

This scenario enables the extraction of the body text of a document and texts on logos, seals, and on any elements other than the body text.

The natural order of the text "how a human would read it" is preserved. You can then feed the documents to natural language processing (NLP) engines on your side, for example, to be quickly summarized, searched for sensitive information, or go through a sentiment review.

如需提取文档的主要文本，通过扫描获得或者以电子格式保存的图像文件通常要经过多个处理阶段，每个阶段有自己的具体细节：

预处理扫描的图像或照片

已扫描图像在识别之前可能需要一些预处理，例如，如果已扫描文档包含背景噪音、歪斜文本、反转颜色、黑色边距、错误方向或者分辨率等。

识别文档图像上最大量的文字

执行图像识别使用的设置须确保从文档图像中找到并提取所有可能的文本。

场景实现

下面内容详细说明了在该场景中使用 ABBYY FineReader Engine 12 的建议方法。建议的方法使用最适合该场景的处理设置。

第1步加载 ABBYY FineReader Engine

要开始使用 ABBYY FineReader Engine，您需要创建 Engine 对象。Engine 对象是 ABBYY FineReader Engine 对象层次体系中的顶级对象，提供各种全局设置、一些处理方法和用于创建其它对象的方法。

要创建 Engine 对象，您可以使用 InitializeEngine 导出的函数。另请加载 Engine 对象的不同方式。

C#

public class EngineLoader : IDisposable
{
    public EngineLoader()
    {
        // 用 FREngine.dll 的完整路径、客户项目 ID，
        // 以及（如适用）您的在线许可证令牌文件路径和在线许可证密码初始化这些变量
        string enginePath = "";
        string customerProjectId = "";
        string licensePath = "";
        string licensePassword = "";
        // 加载 FREngine.dll 库
        dllHandle = LoadLibraryEx(enginePath, IntPtr.Zero, LOAD_WITH_ALTERED_SEARCH_PATH);
           
        try
        {
            if (dllHandle == IntPtr.Zero)
            {
                throw new Exception("无法加载" + enginePath);
            }
            IntPtr initializeEnginePtr = GetProcAddress(dllHandle, "InitializeEngine");
            if (initializeEnginePtr == IntPtr.Zero)
            {
                throw new  Exception("无法找到 InitializeEngine 函数");
            }
            IntPtr deinitializeEnginePtr = GetProcAddress(dllHandle, "DeinitializeEngine");
            if (deinitializeEnginePtr == IntPtr.Zero)
            {
                throw new Exception("无法找到 DeinitializeEngine 函数");
            }
            IntPtr dllCanUnloadNowPtr = GetProcAddress(dllHandle, "DllCanUnloadNow");
            if (dllCanUnloadNowPtr == IntPtr.Zero)
            {
                throw new Exception("无法找到 DllCanUnloadNow 函数");
            }
            // 转换指针为代理
            initializeEngine = (InitializeEngine)Marshal.GetDelegateForFunctionPointer(
                initializeEnginePtr, typeof(InitializeEngine));
            deinitializeEngine = (DeinitializeEngine)Marshal.GetDelegateForFunctionPointer(
                deinitializeEnginePtr, typeof(DeinitializeEngine));
            dllCanUnloadNow = (DllCanUnloadNow)Marshal.GetDelegateForFunctionPointer(
                dllCanUnloadNowPtr, typeof(DllCanUnloadNow));
            // 调用 InitializeEngine 函数
            // 传递路径到在线许可证文件路径和在线许可证密码
            int hresult = initializeEngine(customerProjectId, licensePath, licensePassword, 
                "", "", false, ref engine);
            Marshal.ThrowExceptionForHR(hresult);
        }
        catch (Exception)
        {
            // 释放 FREngine.dll 库
            engine = null;
            // FreeLibrary 调用前删除所有对象
            GC.Collect();
            GC.WaitForPendingFinalizers();
            GC.Collect();
            FreeLibrary(dllHandle);
            dllHandle = IntPtr.Zero;
            initializeEngine = null;
            deinitializeEngine = null;
            dllCanUnloadNow = null;
            throw;
        }
    }
    // Kernel32.dll 函数
    [DllImport("kernel32.dll")]
    private static extern IntPtr LoadLibraryEx(string dllToLoad, IntPtr reserved, uint flags);
    private const uint LOAD_WITH_ALTERED_SEARCH_PATH = 0x00000008;
    [DllImport("kernel32.dll")]
    private static extern IntPtr GetProcAddress(IntPtr hModule, string procedureName);
    [DllImport("kernel32.dll")]
    private static extern bool FreeLibrary(IntPtr hModule);
    // FREngine.dll 函数 
    [UnmanagedFunctionPointer(CallingConvention.StdCall, CharSet = CharSet.Unicode)]
    private delegate int InitializeEngine(string customerProjectId, string licensePath, 
        string licensePassword, string tempFolder, string dataFolder, bool isSharedCPUCoresMode, 
        ref FREngine.IEngine engine);
    [UnmanagedFunctionPointer(CallingConvention.StdCall)]
    private delegate int DeinitializeEngine();
    [UnmanagedFunctionPointer(CallingConvention.StdCall)]
    private delegate int DllCanUnloadNow();
    // 私有变量
    private FREngine.IEngine engine = null;
    // FREngine.dll 句柄 
    private IntPtr dllHandle = IntPtr.Zero;
    private InitializeEngine initializeEngine = null;
    private DeinitializeEngine deinitializeEngine = null;
    private DllCanUnloadNow dllCanUnloadNow = null;
}

C++ (COM)

// 用 FREngine.dll 路径、 FineReader Engine 客户项目 ID，
// 和（如有）在线许可证令牌路径和在线许可证密码初始化这些变量
wchar_t* FreDllPath;
wchar_t* CustomerProjectId;
wchar_t* LicensePath;  // 如果不使用在线许可证，则分配空字符串给这些变量
wchar_t* LicensePassword;
// FREngine.dll 的句柄
static HMODULE libraryHandle = 0;
// 全局 FineReader Engine 对象
FREngine::IEnginePtr Engine;
void LoadFREngine()
{
    if( Engine != 0 ) {
    // 已加载
    return;
    }
    // 第1步：加载 FREngine.dll
    if( libraryHandle == 0 ) {
        libraryHandle = LoadLibraryEx( FreDllPath, 0, LOAD_WITH_ALTERED_SEARCH_PATH );
        if( libraryHandle == 0 ) {
            throw L"加载 ABBYY FineReader Engine 时出错";
        }
    }
    // 第2步：获取 Engine 对象
    typedef HRESULT ( STDAPICALLTYPE* InitializeEngineFunc )( BSTR, BSTR, BSTR, BSTR, 
        BSTR, VARIANT_BOOL, FREngine::IEngine** );
    InitializeEngineFunc pInitializeEngine =
    ( InitializeEngineFunc )GetProcAddress( libraryHandle, "InitializeEngine");
    if( pInitializeEngine == 0 || pInitializeEngine( CustomerProjectId, LicensePath, 
        LicensePassword, L"", L"", VARIANT_FALSE, &Engine ) != S_OK ) {
    UnloadFREngine();
    throw L"加载 ABBYY FineReader Engine 时出错";
    }
}

第2步加载方案设置

对该场景最适合的设置可以在 ABBYY FineReader Engine 中使用 Engine 对象的 LoadPredefinedProfile 方法进行选择。该方法会接收配置文件名称为输入参数。请参阅配置文件的处理了解更多信息。

ABBYY FineReader Engine 支持该方案的两种设置：

配置文件名称	说明
TextExtraction_Accuracy	设置已就准确度进行了优化：可让检测图像上的所有文本，包括低质量的小文本区域（不检测图片和表格）。不执行文档逻辑结构的完全合成。重要事项！配置文件不是为了将文档转换为 RTF、DOCX 或纯文本 PDF格式。使用适合此类目的文档转换配置文件。
TextExtraction_Speed	设置已就处理速度进行了优化：可让检测图像上的所有文本，包括低质量的小文本区域（不检测图片和表格）。不执行文档逻辑结构的完全合成。文档分析和识别进程已加速。重要事项！配置文件不只是为了将文档转换为 RTF、DOCX 或纯文本 PDF格式。使用适合此类目的文档转换配置文件。

配置文件名称

说明

TextExtraction_Accuracy

设置已就准确度进行了优化：

可让检测图像上的所有文本，包括低质量的小文本区域（不检测图片和表格）。
不执行文档逻辑结构的完全合成。

重要事项！配置文件不是为了将文档转换为 RTF、DOCX 或纯文本 PDF格式。使用适合此类目的文档转换配置文件。

TextExtraction_Speed

设置已就处理速度进行了优化：

可让检测图像上的所有文本，包括低质量的小文本区域（不检测图片和表格）。
不执行文档逻辑结构的完全合成。
文档分析和识别进程已加速。

重要事项！配置文件不只是为了将文档转换为 RTF、DOCX 或纯文本 PDF格式。使用适合此类目的文档转换配置文件。

C#

// 加载预定义配置文件
engine.LoadPredefinedProfile("TextExtraction_Accuracy");

C++ (COM)

// 加载预定义配置文件
Engine->LoadPredefinedProfile( L"TextExtraction_Accuracy" );

如需更改处理设置，请使用适当的参数对象。请参阅以下针对具体任务的其他优化了解更多信息。

第3步加载和预处理图像

第4步文档识别

第5步搜索重要信息

(Optional) 第6步文档导出

第7步卸载 ABBYY FineReader Engine

使用 ABBYY FineReader Engine 完成工作后，您需要卸载 Engine 对象。为此，请使用 DeinitializeEngine 导出功能。

C#

public class EngineLoader : IDisposable
{
    // 卸载 FineReader Engine
    public void Dispose()
    {
        if (engine == null)
        {
            // Engine 未加载 
            return;
        }
        engine = null;
        int hresult = deinitializeEngine();
        // 在 FreeLibrary 调用前删除所有对象
        GC.Collect();
        GC.WaitForPendingFinalizers();
        GC.Collect();
        int hresult = deinitializeEngine();

        hresult = dllCanUnloadNow();
        if (hresult == 0)
        {
            FreeLibrary(dllHandle);
        }
        dllHandle = IntPtr.Zero;
        initializeEngine = null;
        deinitializeEngine = null;
        dllCanUnloadNow = null;
        // 清理后引发异常
        Marshal.ThrowExceptionForHR(hresult);
    }
    // Kernel32.dll 函数
    [DllImport("kernel32.dll")]
    private static extern IntPtr LoadLibraryEx(string dllToLoad, IntPtr reserved, uint flags);
    private const uint LOAD_WITH_ALTERED_SEARCH_PATH = 0x00000008;
    [DllImport("kernel32.dll")]
    private static extern IntPtr GetProcAddress(IntPtr hModule, string procedureName);
    [DllImport("kernel32.dll")]
    private static extern bool FreeLibrary(IntPtr hModule);
    // FREngine.dll 函数
    [UnmanagedFunctionPointer(CallingConvention.StdCall, CharSet = CharSet.Unicode)]
    private delegate int InitializeEngine( string customerProjectId, string LicensePath, string LicensePassword, , , , ref FREngine.IEngine engine);
    [UnmanagedFunctionPointer(CallingConvention.StdCall)]
    private delegate int DeinitializeEngine();
    [UnmanagedFunctionPointer(CallingConvention.StdCall)]
    private delegate int DllCanUnloadNow();
    // 私有变量
    private FREngine.IEngine engine = null;
    // FREngine.dll 句柄
    private IntPtr dllHandle = IntPtr.Zero;
    private InitializeEngine initializeEngine = null;
    private DeinitializeEngine deinitializeEngine = null;
    private DllCanUnloadNow dllCanUnloadNow = null;
}

C++ (COM)

void UnloadFREngine()
{
if( libraryHandle == 0 ) {
  return;
 }
// 释放 Engine 对象
 Engine = 0;
// 取消初始化 FineReader Engine
typedef HRESULT ( STDAPICALLTYPE* DeinitializeEngineFunc )();
 DeinitializeEngineFunc pDeinitializeEngine =
  ( DeinitializeEngineFunc )GetProcAddress( libraryHandle, “DeinitializeEngine” );
if( pDeinitializeEngine == 0 || pDeinitializeEngine() != S_OK ) {
  throw L"卸载 ABBYY FineReader Engine 时出错";
 }
// 现在可以安全释放 FREngine.dll 库
 FreeLibrary( libraryHandle );
 libraryHandle = 0;
}

所需资源

您可以使用 FREngineDistribution.csv 文件来自动创建应用程序正常工作所需的文件列表。若要用该方案进行处理，请在栏5 (RequiredByModule) 中对以下值进行选择：

Core

Core.Resources

Opening

Opening, Processing

Processing

Processing.OCR

Processing.OCR, Processing.ICR

Processing.OCR.NaturalLanguages

Processing.OCR.NaturalLanguages, Processing.ICR.NaturalLanguages

Export

Export, Processing

如果修改标准场景，请相应更改所需的模块。您还需要指定界面语言、识别语言和应用程序使用的任何其他功能（例如，如果需要打开 PDF 文件，则使用 Opening.PDF；如果需要识别 CJK 语言中的文本，则使用 Processing.OCR.CJK）。请参见借助 FREngineDistribution.csv 文件处理进一步了解详情。

针对具体任务的其他优化

扫描

扫描
ABBYY FineReader Engine 文档扫描方案说明。

识别

微调页面预处理、分析、识别和合成参数
使用分析、识别和合成参数的对象自定义文档处理。

Recognize handwriting
The TextExtraction_*** profiles do not include handwritten or handprinted text recognition. If you need to recognize handwriting, set the DetectHandwritten property of the PageAnalysisParams object to TRUE.

PageProcessingParams 对象
该对象可让您自定义分析和识别参数。通过使用该对象，您可以指出必须检测哪些图像和文本特点（翻转图像，方向，条形码，识别语言，识别误差）。
SynthesisParamsForPage 对象
该对象包括负责在合成期间恢复页面格式的参数。
SynthesisParamsForDocument 对象
该对象可让您自定义文档合成：恢复其结构和格式。
MultiProcessingParams 对象
同时处理可能会在处理大量图像时适用。在此情况下，处理负荷会于图像打开和预处理、布局分析和识别期间在处理器内核之间分布，使得可能加速处理。
读取模式（同时或者连续）使用 MultiProcessingMode 属性进行设置。RecognitionProcessesCount 属性控制可被启动的进程数量。

搜索重要信息

处理布局和块
有关页面布局、块类型以及如何处理它们。
Layout 对象
该对象的参数可以在文档识别后提供对页面布局和已识别文本的访问。
处理文本
处理已识别文本、段落、词汇和符号。

使用针对指定数据类型的特殊参数重新读取文档

字段别识别
文本短片段识别场景说明。

保存数据

保存已识别文档可以使用 FRDocument 对象的 Export 或 ExportPages 方法，做法是将 FileExportFormatEnum 常数分派为一个参数。
文档存档
将文档保存为电子副本的场景说明。

另请参阅

基本使用场景实现

07.11.2025 12:48:30

Your use of this site is conditioned on Your continued compliance with the Terms of Use.

Terms of Use

Disclaimer of Warranty

Limitation of Liability

Transmission and Submission of Information

Downloads

Use of Content

Trademarks

Links to Third-Party Sites

Foreign Legislation

Subscription Terms

Partner Subscription Terms

文本提取

场景实现

第1步加载 ABBYY FineReader Engine

C#

C++ (COM)

第2步加载方案设置

C#

C++ (COM)

第3步加载和预处理图像

C#

C++ (COM)

第4步文档识别

C#

C++ (COM)

第5步搜索重要信息

(Optional) 第6步文档导出

C#

C++ (COM)

第7步卸载 ABBYY FineReader Engine

C#

C++ (COM)

所需资源

针对具体任务的其他优化

另请参阅