English (English)

Chinese Simplified (简体中文)

Text Extraction

This scenario is used to recognize all document text to prepare documents for search and extraction of useful data.

The steps outlined below may serve as a basis for implementing more complex procedures which will extract the necessary data from documents, especially for automated input of paper document data into information systems and databases, and for automated classification and indexing of documents in document management systems (e.g., inputting invoices into accounting software, inputting questionnaires into a CRM system).

This scenario enables the extraction of the body text of a document and texts on logos, seals, and any elements other than the body text.

To extract the main text of the document, image files obtained by scanning or saved in the electronic format typically go through several processing stages, each of which has its own peculiarities:

Preprocessing the scanned images or photos

Scanned images may require some preprocessing prior to recognition, for example, if scanned documents contain background noise, skewed text, inverted colors, black margins, wrong orientation, or resolution.

Recognition of maximum amount of text on a document image

Recognition of images is performed using settings that ensure that all possible text is found and extracted from a document image.

The text obtained as a result of the processing may be used to search for the important data (however, instructions on how to implement this search are outside the scope of this article). Special algorithms can be devised to look up key words, e.g., titles of form fields, tables, lines and table columns, signature and stamp fields, etc. Fields containing important data can be found by key words and then re-read using special recognition parameters depending on the type of data. You can also check them for consistency with the data type and compliance with the necessary restrictions.

The extracted data can be saved to a database and an uneditable copy of the paper document can be put in the digital archive.

Scenario implementation

Below is the detailed description of the recommended method of using ABBYY FineReader Engine 12 in this scenario. The proposed method uses processing settings that are most suitable for this scenario.

Step 1. Loading ABBYY FineReader Engine

To start your work with ABBYY FineReader Engine, you need to create the Engine object. The Engine object is the top object in the hierarchy of the ABBYY FineReader Engine objects and provides various global settings, some processing methods, and methods for creating the other objects.

To create the Engine object, you can use the InitializeEngine function. See also other ways to load Engine object.

C#

public class EngineLoader : IDisposable
{
    public EngineLoader()
    {
        // Initialize these variables with the full path to FREngine.dll, your Customer Project ID,
        // and, if applicable, the path to your Online License token file and the Online License password
        string enginePath = "";
        string customerProjectId = "";
        string licensePath = "";
        string licensePassword = "";
        // Load the FREngine.dll library
        dllHandle = LoadLibraryEx(enginePath, IntPtr.Zero, LOAD_WITH_ALTERED_SEARCH_PATH);
           
        try
        {
            if (dllHandle == IntPtr.Zero)
            {
                throw new Exception("Can't load " + enginePath);
            }
            IntPtr initializeEnginePtr = GetProcAddress(dllHandle, "InitializeEngine");
            if (initializeEnginePtr == IntPtr.Zero)
            {
                throw new Exception("Can't find InitializeEngine function");
            }
            IntPtr deinitializeEnginePtr = GetProcAddress(dllHandle, "DeinitializeEngine");
            if (deinitializeEnginePtr == IntPtr.Zero)
            {
                throw new Exception("Can't find DeinitializeEngine function");
            }
            IntPtr dllCanUnloadNowPtr = GetProcAddress(dllHandle, "DllCanUnloadNow");
            if (dllCanUnloadNowPtr == IntPtr.Zero)
            {
                throw new Exception("Can't find DllCanUnloadNow function");
            }
            // Convert pointers to delegates
            initializeEngine = (InitializeEngine)Marshal.GetDelegateForFunctionPointer(
                initializeEnginePtr, typeof(InitializeEngine));
            deinitializeEngine = (DeinitializeEngine)Marshal.GetDelegateForFunctionPointer(
                deinitializeEnginePtr, typeof(DeinitializeEngine));
            dllCanUnloadNow = (DllCanUnloadNow)Marshal.GetDelegateForFunctionPointer(
                dllCanUnloadNowPtr, typeof(DllCanUnloadNow));
            // Call the InitializeEngine function 
            // passing the path to the Online License file and the Online License password
            int hresult = initializeEngine(customerProjectId, licensePath, licensePassword, 
                "", "", false, ref engine);
            Marshal.ThrowExceptionForHR(hresult);
        }
        catch (Exception)
        {
            // Free the FREngine.dll library
            engine = null;
            // Deleting all objects before FreeLibrary call
            GC.Collect();
            GC.WaitForPendingFinalizers();
            GC.Collect();
            FreeLibrary(dllHandle);
            dllHandle = IntPtr.Zero;
            initializeEngine = null;
            deinitializeEngine = null;
            dllCanUnloadNow = null;
            throw;
        }
    }
    // Kernel32.dll functions
    [DllImport("kernel32.dll")]
    private static extern IntPtr LoadLibraryEx(string dllToLoad, IntPtr reserved, uint flags);
    private const uint LOAD_WITH_ALTERED_SEARCH_PATH = 0x00000008;
    [DllImport("kernel32.dll")]
    private static extern IntPtr GetProcAddress(IntPtr hModule, string procedureName);
    [DllImport("kernel32.dll")]
    private static extern bool FreeLibrary(IntPtr hModule);
    // FREngine.dll functions
    [UnmanagedFunctionPointer(CallingConvention.StdCall, CharSet = CharSet.Unicode)]
    private delegate int InitializeEngine(string customerProjectId, string licensePath, 
        string licensePassword, string tempFolder, string dataFolder, bool isSharedCPUCoresMode, 
        ref FREngine.IEngine engine);
    [UnmanagedFunctionPointer(CallingConvention.StdCall)]
    private delegate int DeinitializeEngine();
    [UnmanagedFunctionPointer(CallingConvention.StdCall)]
    private delegate int DllCanUnloadNow();
    // private variables
    private FREngine.IEngine engine = null;
    // Handle to FREngine.dll
    private IntPtr dllHandle = IntPtr.Zero;
    private InitializeEngine initializeEngine = null;
    private DeinitializeEngine deinitializeEngine = null;
    private DllCanUnloadNow dllCanUnloadNow = null;
}

C++ (COM)

// Initialize these variables with the path to FREngine.dll, your FineReader Engine customer project ID,
// and, if applicable, the path to the Online License token and Online License password
wchar_t* FreDllPath;
wchar_t* CustomerProjectId;
wchar_t* LicensePath;  // if you don't use an Online License, assign empty strings to these variables
wchar_t* LicensePassword;
// HANDLE to FREngine.dll
static HMODULE libraryHandle = 0;
// Global FineReader Engine object
FREngine::IEnginePtr Engine;
void LoadFREngine()
{
    if( Engine != 0 ) {
    // Already loaded
    return;
    }
    // First step: load FREngine.dll
    if( libraryHandle == 0 ) {
        libraryHandle = LoadLibraryEx( FreDllPath, 0, LOAD_WITH_ALTERED_SEARCH_PATH );
        if( libraryHandle == 0 ) {
            throw L"Error while loading ABBYY FineReader Engine";
        }
    }
    // Second step: obtain the Engine object
    typedef HRESULT ( STDAPICALLTYPE* InitializeEngineFunc )( BSTR, BSTR, BSTR, BSTR, 
        BSTR, VARIANT_BOOL, FREngine::IEngine** );
    InitializeEngineFunc pInitializeEngine =
    ( InitializeEngineFunc )GetProcAddress( libraryHandle, "InitializeEngine" );
    if( pInitializeEngine == 0 || pInitializeEngine( CustomerProjectId, LicensePath, 
        LicensePassword, L"", L"", VARIANT_FALSE, &Engine ) != S_OK ) {
    UnloadFREngine();
    throw L"Error while loading ABBYY FineReader Engine";
    }
}

Step 2. Loading settings for the scenario

The most suitable settings for this scenario can be selected in ABBYY FineReader Engine using the LoadPredefinedProfile method of the Engine object. This method receives the profile name as an input parameter. Please see Working with Profiles for more information.

ABBYY FineReader Engine supports 2 variants of settings for this scenario:

Profile name	Description
TextExtraction_Accuracy	The settings have been optimized for accuracy: Enables detection of all text on an image, including small text areas of low quality (pictures and tables are not detected). Full synthesis of the logical structure of a document is not performed. Important! The profile is not intended for converting a document into RTF, DOCX, or text-only PDF. Use the document conversion profiles for such purpose.
TextExtraction_Speed	The settings have been optimized for processing speed: Enables detection of all text on an image, including small text areas of low quality (pictures and tables are not detected). Full synthesis of the logical structure of a document is not performed. The processes of document analysis and recognition are speeded up. Important! The profile is not intended for converting a document into RTF, DOCX, or text-only PDF. Use the document conversion profiles for such purpose.

Profile name

Description

TextExtraction_Accuracy

The settings have been optimized for accuracy:

Enables detection of all text on an image, including small text areas of low quality (pictures and tables are not detected).
Full synthesis of the logical structure of a document is not performed.

Important! The profile is not intended for converting a document into RTF, DOCX, or text-only PDF. Use the document conversion profiles for such purpose.

TextExtraction_Speed

The settings have been optimized for processing speed:

Enables detection of all text on an image, including small text areas of low quality (pictures and tables are not detected).
Full synthesis of the logical structure of a document is not performed.
The processes of document analysis and recognition are speeded up.

Important! The profile is not intended for converting a document into RTF, DOCX, or text-only PDF. Use the document conversion profiles for such purpose.

C#

// Load a predefined profile
engine.LoadPredefinedProfile("TextExtraction_Accuracy");

C++ (COM)

// Load a predefined profile
Engine->LoadPredefinedProfile( L"TextExtraction_Accuracy" );

If you wish to change processing settings, use appropriate parameter objects. Please see Additional optimization for specific tasks below for further information.

Step 3. Loading and preprocessing the images

Step 4. Document recognition

Step 5. Searching for important information

Step 6. Unloading ABBYY FineReader Engine

After finishing your work with ABBYY FineReader Engine, you need to unload the Engine object. To do this, use the DeinitializeEngine exported function.

C#

public class EngineLoader : IDisposable
{
    // Unload FineReader Engine
    public void Dispose()
    {
        if (engine == null)
        {
            // Engine was not loaded
            return;
        }
        engine = null;
        // Deleting all objects before FreeLibrary call
        GC.Collect();
        GC.WaitForPendingFinalizers();
        GC.Collect();
        int hresult = deinitializeEngine();

        hresult = dllCanUnloadNow();
        if (hresult == 0)
        {
            FreeLibrary(dllHandle);
        }
        dllHandle = IntPtr.Zero;
        initializeEngine = null;
        deinitializeEngine = null;
        dllCanUnloadNow = null;
        // throwing exception after cleaning up
        Marshal.ThrowExceptionForHR(hresult);
    }
    // Kernel32.dll functions
    [DllImport("kernel32.dll")]
    private static extern IntPtr LoadLibraryEx(string dllToLoad, IntPtr reserved, uint flags);
    private const uint LOAD_WITH_ALTERED_SEARCH_PATH = 0x00000008;
    [DllImport("kernel32.dll")]
    private static extern IntPtr GetProcAddress(IntPtr hModule, string procedureName);
    [DllImport("kernel32.dll")]
    private static extern bool FreeLibrary(IntPtr hModule);
    // FREngine.dll functions
    [UnmanagedFunctionPointer(CallingConvention.StdCall, CharSet = CharSet.Unicode)]
    private delegate int InitializeEngine( string customerProjectId, string LicensePath, string LicensePassword, , , , ref FREngine.IEngine engine);
    [UnmanagedFunctionPointer(CallingConvention.StdCall)]
    private delegate int DeinitializeEngine();
    [UnmanagedFunctionPointer(CallingConvention.StdCall)]
    private delegate int DllCanUnloadNow();
    // private variables
    private FREngine.IEngine engine = null;
    // Handle to FREngine.dll
    private IntPtr dllHandle = IntPtr.Zero;
    private InitializeEngine initializeEngine = null;
    private DeinitializeEngine deinitializeEngine = null;
    private DllCanUnloadNow dllCanUnloadNow = null;
}

C++ (COM)

void UnloadFREngine()
{
 if( libraryHandle == 0 ) {
  return;
 }
 // Release Engine object
 Engine = 0;
 // Deinitialize FineReader Engine
 typedef HRESULT ( STDAPICALLTYPE* DeinitializeEngineFunc )();
 DeinitializeEngineFunc pDeinitializeEngine =
  ( DeinitializeEngineFunc )GetProcAddress( libraryHandle, "DeinitializeEngine" );
 if( pDeinitializeEngine == 0 || pDeinitializeEngine() != S_OK ) {
  throw L"Error while unloading ABBYY FineReader Engine";
 }
 // Now it's safe to free the FREngine.dll library
 FreeLibrary( libraryHandle );
 libraryHandle = 0;
}

Required resources

You can use the FREngineDistribution.csv file to automatically create a list of files required for your application to function. For processing with this scenario, select in the column 5 (RequiredByModule) the following values:

Core

Core.Resources

Opening

Opening, Processing

Processing

Processing.OCR

Processing.OCR, Processing.ICR

Processing.OCR.NaturalLanguages

Processing.OCR.NaturalLanguages, Processing.ICR.NaturalLanguages

If you modify the standard scenario, change the required modules accordingly. You also need to specify the interface languages, recognition languages and any additional features which your application uses (such as, e.g., Opening.PDF if you need to open PDF files, or Processing.OCR.CJK if you need to recognize texts in CJK languages). See Working with the FREngineDistribution.csv File for further details.

Additional optimization for specific tasks

Scanning

Scanning
Description of the ABBYY FineReader Engine scenario for document scanning.

Opening and preprocessing

Image Preprocessing
Description of the ABBYY FineReader Engine scenario for preliminary preparation of images.

Recognition

Tuning Parameters of Preprocessing, Analysis, Recognition, and Synthesis
Customization of document processing using objects of analysis, recognition and synthesis parameters.
PageProcessingParams Object
This object enables customization of analysis and recognition parameters. Using this object, you can indicate which image and text characteristics must be detected (inverted image, orientation, barcodes, recognition language, recognition error margin).
SynthesisParamsForPage Object
This object includes parameters responsible for restoration of a page formatting during synthesis.
SynthesisParamsForDocument Object
This object enables customization of document synthesis: restoration of its structure and formatting.
MultiProcessingParams Object
Simultaneous processing may be useful when processing a large number of images. In this case, the processing load will be spread over the processor cores during image opening and preprocessing, layout analysis, and recognition, which makes it possible to speed up processing.
Reading modes (simultaneous or consecutive) are set using the MultiProcessingMode property. The RecognitionProcessesCount property controls the number of processes that may be started.

Searching for important information

Working with Layout and Blocks
About page layout, block types, and working with them.
Layout Object
This object's parameters provide access to the page layout and the recognized text after document recognition.
Working with Text
Working with recognized text, paragraphs, words, and symbols.

Re-reading of document using special parameters for specified data type

Field-Level Recognition
Description of scenario for recognizing short text segments.

Saving data

To save recognized data, you may use the Export or ExportPages methods of the FRDocument object by assigning the FileExportFormatEnum constant as one of the parameters.
Document Archiving
Description of the scenario for saving an electronic copy of document.

Your use of this site is conditioned on Your continued compliance with the Terms of Use.

Terms of Use

Disclaimer of Warranty

Limitation of Liability

Transmission and Submission of Information

Downloads

Use of Content

Trademarks

Links to Third-Party Sites

Foreign Legislation

Subscription Terms

Partner Subscription Terms

Text Extraction

Scenario implementation

Step 1. Loading ABBYY FineReader Engine

C#

C++ (COM)

Step 2. Loading settings for the scenario

C#

C++ (COM)

Step 3. Loading and preprocessing the images

C#

C++ (COM)

Step 4. Document recognition

C#

C++ (COM)

Step 5. Searching for important information

Step 6. Unloading ABBYY FineReader Engine

C#

C++ (COM)

Required resources

Additional optimization for specific tasks

See also