English (English)

Chinese Simplified (简体中文)

Data Extraction

This scenario is used to extract all possible data from a document and store it in a structured way.

The result is a JSON file which represents the document structure. It stores all document objects: printed and handwritten text, tables, barcodes, checkmarks, and images with their location and attributes. This format is optimal for further processing, storing data in a database, or integrating with another application.

A document goes through several processing steps in this scenario:

Preprocessing of scanned images or photos

Images you get by means of a scanner or a digital camera may need some tweaking before they can be optically recognized. For example, noisy images or images with distorted text lines will need some correction for optical recognition to be successful.

Extracting all the data on the document in a structured way

During layout analysis, various objects are detected on the image and put into blocks of corresponding type. The blocks are recognized according to the optimal settings of their type. In the course of synthesis, the logical structure of the document is restored in a consistent manner. The text order even for complex layouts is preserved to be similar to how a human would read it. This ensures that re-recognition of the same document would result in the same order of the text.

Export to a structured format

The recognized document is saved to JSON or XML.

Scenario implementation

Below you will find a detailed description of the recommended method of using ABBYY FineReader Engine 12 to extract the data from documents. The proposed method uses the processing settings that are most suitable for this purpose.

Step 1. Loading ABBYY FineReader Engine

To start your work with ABBYY FineReader Engine, you need to create the Engine object. The Engine object is the top object in the hierarchy of the ABBYY FineReader Engine objects and provides various global settings, some processing methods, and methods for creating the other objects.

To create the Engine object, you can use the InitializeEngine function. See also other ways to load Engine object.

C#

public class EngineLoader : IDisposable
{
    public EngineLoader()
    {
        // Initialize these variables with the full path to FREngine.dll, your Customer Project ID,
        // and, if applicable, the path to your Online License token file and the Online License password
        string enginePath = "";
        string customerProjectId = "";
        string licensePath = "";
        string licensePassword = "";
        // Load the FREngine.dll library
        dllHandle = LoadLibraryEx(enginePath, IntPtr.Zero, LOAD_WITH_ALTERED_SEARCH_PATH);
           
        try
        {
            if (dllHandle == IntPtr.Zero)
            {
                throw new Exception("Can't load " + enginePath);
            }
            IntPtr initializeEnginePtr = GetProcAddress(dllHandle, "InitializeEngine");
            if (initializeEnginePtr == IntPtr.Zero)
            {
                throw new Exception("Can't find InitializeEngine function");
            }
            IntPtr deinitializeEnginePtr = GetProcAddress(dllHandle, "DeinitializeEngine");
            if (deinitializeEnginePtr == IntPtr.Zero)
            {
                throw new Exception("Can't find DeinitializeEngine function");
            }
            IntPtr dllCanUnloadNowPtr = GetProcAddress(dllHandle, "DllCanUnloadNow");
            if (dllCanUnloadNowPtr == IntPtr.Zero)
            {
                throw new Exception("Can't find DllCanUnloadNow function");
            }
            // Convert pointers to delegates
            initializeEngine = (InitializeEngine)Marshal.GetDelegateForFunctionPointer(
                initializeEnginePtr, typeof(InitializeEngine));
            deinitializeEngine = (DeinitializeEngine)Marshal.GetDelegateForFunctionPointer(
                deinitializeEnginePtr, typeof(DeinitializeEngine));
            dllCanUnloadNow = (DllCanUnloadNow)Marshal.GetDelegateForFunctionPointer(
                dllCanUnloadNowPtr, typeof(DllCanUnloadNow));
            // Call the InitializeEngine function 
            // passing the path to the Online License file and the Online License password
            int hresult = initializeEngine(customerProjectId, licensePath, licensePassword, 
                "", "", false, ref engine);
            Marshal.ThrowExceptionForHR(hresult);
        }
        catch (Exception)
        {
            // Free the FREngine.dll library
            engine = null;
            // Deleting all objects before FreeLibrary call
            GC.Collect();
            GC.WaitForPendingFinalizers();
            GC.Collect();
            FreeLibrary(dllHandle);
            dllHandle = IntPtr.Zero;
            initializeEngine = null;
            deinitializeEngine = null;
            dllCanUnloadNow = null;
            throw;
        }
    }
    // Kernel32.dll functions
    [DllImport("kernel32.dll")]
    private static extern IntPtr LoadLibraryEx(string dllToLoad, IntPtr reserved, uint flags);
    private const uint LOAD_WITH_ALTERED_SEARCH_PATH = 0x00000008;
    [DllImport("kernel32.dll")]
    private static extern IntPtr GetProcAddress(IntPtr hModule, string procedureName);
    [DllImport("kernel32.dll")]
    private static extern bool FreeLibrary(IntPtr hModule);
    // FREngine.dll functions
    [UnmanagedFunctionPointer(CallingConvention.StdCall, CharSet = CharSet.Unicode)]
    private delegate int InitializeEngine(string customerProjectId, string licensePath, 
        string licensePassword, string tempFolder, string dataFolder, bool isSharedCPUCoresMode, 
        ref FREngine.IEngine engine);
    [UnmanagedFunctionPointer(CallingConvention.StdCall)]
    private delegate int DeinitializeEngine();
    [UnmanagedFunctionPointer(CallingConvention.StdCall)]
    private delegate int DllCanUnloadNow();
    // private variables
    private FREngine.IEngine engine = null;
    // Handle to FREngine.dll
    private IntPtr dllHandle = IntPtr.Zero;
    private InitializeEngine initializeEngine = null;
    private DeinitializeEngine deinitializeEngine = null;
    private DllCanUnloadNow dllCanUnloadNow = null;
}

C++ (COM)

// Initialize these variables with the path to FREngine.dll, your FineReader Engine customer project ID,
// and, if applicable, the path to the Online License token and Online License password
wchar_t* FreDllPath;
wchar_t* CustomerProjectId;
wchar_t* LicensePath;  // if you don't use an Online License, assign empty strings to these variables
wchar_t* LicensePassword;
// HANDLE to FREngine.dll
static HMODULE libraryHandle = 0;
// Global FineReader Engine object
FREngine::IEnginePtr Engine;
void LoadFREngine()
{
    if( Engine != 0 ) {
    // Already loaded
    return;
    }
    // First step: load FREngine.dll
    if( libraryHandle == 0 ) {
        libraryHandle = LoadLibraryEx( FreDllPath, 0, LOAD_WITH_ALTERED_SEARCH_PATH );
        if( libraryHandle == 0 ) {
            throw L"Error while loading ABBYY FineReader Engine";
        }
    }
    // Second step: obtain the Engine object
    typedef HRESULT ( STDAPICALLTYPE* InitializeEngineFunc )( BSTR, BSTR, BSTR, BSTR, 
        BSTR, VARIANT_BOOL, FREngine::IEngine** );
    InitializeEngineFunc pInitializeEngine =
    ( InitializeEngineFunc )GetProcAddress( libraryHandle, "InitializeEngine" );
    if( pInitializeEngine == 0 || pInitializeEngine( CustomerProjectId, LicensePath, 
        LicensePassword, L"", L"", VARIANT_FALSE, &Engine ) != S_OK ) {
    UnloadFREngine();
    throw L"Error while loading ABBYY FineReader Engine";
    }
}

Step 2. Loading settings for the scenario

Step 3. Loading and preprocessing the images

Step 4. Document recognition

Step 5. Document export

Step 6. Unloading ABBYY FineReader Engine

After finishing your work with ABBYY FineReader Engine, you need to unload the Engine object. To do this, use the DeinitializeEngine exported function.

C#

public class EngineLoader : IDisposable
{
    // Unload FineReader Engine
    public void Dispose()
    {
        if (engine == null)
        {
            // Engine was not loaded
            return;
        }
        engine = null;
        // Deleting all objects before FreeLibrary call
        GC.Collect();
        GC.WaitForPendingFinalizers();
        GC.Collect();
        int hresult = deinitializeEngine();

        hresult = dllCanUnloadNow();
        if (hresult == 0)
        {
            FreeLibrary(dllHandle);
        }
        dllHandle = IntPtr.Zero;
        initializeEngine = null;
        deinitializeEngine = null;
        dllCanUnloadNow = null;
        // throwing exception after cleaning up
        Marshal.ThrowExceptionForHR(hresult);
    }
    // Kernel32.dll functions
    [DllImport("kernel32.dll")]
    private static extern IntPtr LoadLibraryEx(string dllToLoad, IntPtr reserved, uint flags);
    private const uint LOAD_WITH_ALTERED_SEARCH_PATH = 0x00000008;
    [DllImport("kernel32.dll")]
    private static extern IntPtr GetProcAddress(IntPtr hModule, string procedureName);
    [DllImport("kernel32.dll")]
    private static extern bool FreeLibrary(IntPtr hModule);
    // FREngine.dll functions
    [UnmanagedFunctionPointer(CallingConvention.StdCall, CharSet = CharSet.Unicode)]
    private delegate int InitializeEngine( string customerProjectId, string LicensePath, string LicensePassword, , , , ref FREngine.IEngine engine);
    [UnmanagedFunctionPointer(CallingConvention.StdCall)]
    private delegate int DeinitializeEngine();
    [UnmanagedFunctionPointer(CallingConvention.StdCall)]
    private delegate int DllCanUnloadNow();
    // private variables
    private FREngine.IEngine engine = null;
    // Handle to FREngine.dll
    private IntPtr dllHandle = IntPtr.Zero;
    private InitializeEngine initializeEngine = null;
    private DeinitializeEngine deinitializeEngine = null;
    private DllCanUnloadNow dllCanUnloadNow = null;
}

C++ (COM)

void UnloadFREngine()
{
 if( libraryHandle == 0 ) {
  return;
 }
 // Release Engine object
 Engine = 0;
 // Deinitialize FineReader Engine
 typedef HRESULT ( STDAPICALLTYPE* DeinitializeEngineFunc )();
 DeinitializeEngineFunc pDeinitializeEngine =
  ( DeinitializeEngineFunc )GetProcAddress( libraryHandle, "DeinitializeEngine" );
 if( pDeinitializeEngine == 0 || pDeinitializeEngine() != S_OK ) {
  throw L"Error while unloading ABBYY FineReader Engine";
 }
 // Now it's safe to free the FREngine.dll library
 FreeLibrary( libraryHandle );
 libraryHandle = 0;
}

Required resources

You can use the FREngineDistribution.csv file to automatically create a list of files required for your application to function. For processing with this scenario, select in the column 5 (RequiredByModule) the following values:

Core

Core.Resources

Opening

Opening, Processing

Processing

Processing.OCR

Processing.OCR, Processing.ICR

Processing.OCR.NaturalLanguages

Processing.OCR.NaturalLanguages, Processing.ICR.NaturalLanguages

Export

Export, Processing

If you modify the standard scenario, change the required modules accordingly. You also need to specify the interface languages, recognition languages and any additional features which your application uses (such as, for example, Opening.PDF if you need to open PDF files, or Processing.OCR.CJK if you need to recognize texts in CJK languages). See Working with the FREngineDistribution.csv File for further details.

Additional optimization for specific tasks

Below is the overview of the Help topics containing additional information regarding customization of settings at different stages of the document conversion to an editable format:

Scanning

Scanning
Description of the ABBYY FineReader Engine scenario for document scanning.

Recognition

Tuning Parameters of Preprocessing, Analysis, Recognition, and Synthesis
Customization of document processing using objects of analysis, recognition and synthesis parameters.
PageProcessingParams Object
This object enables the customization of analysis and recognition parameters. Using this object, you can indicate which image and text characteristics must be detected (inverted image, orientation, barcodes, recognition language, recognition error margin).
SynthesisParamsForPage Object
This object includes parameters responsible for restoration of a page formatting during synthesis.
SynthesisParamsForDocument Object
This object enables customization of the document synthesis: restoration of its structure and formatting.
MultiProcessingParams Object
Simultaneous processing may be useful when processing a large number of images. In this case, the processing load will be spread over the processor cores during image opening and preprocessing, layout analysis, recognition, and export, which makes it possible to speed up processing.
Reading modes (simultaneous or consecutive) are set using the MultiProcessingMode property. The RecognitionProcessesCount property controls the number of processes that may be started.

Export

Tuning Export Parameters
Customization of the document export using export parameters objects.
XMLExportParams Object
This object provides the settings of export to XML format.
JsonExportParams Object
This object provides the settings of export to JSON format.

Your use of this site is conditioned on Your continued compliance with the Terms of Use.

Terms of Use

Disclaimer of Warranty

Limitation of Liability

Transmission and Submission of Information

Downloads

Use of Content

Trademarks

Links to Third-Party Sites

Foreign Legislation

Subscription Terms

Partner Subscription Terms

Data Extraction

Scenario implementation

Step 1. Loading ABBYY FineReader Engine

C#

C++ (COM)

Step 2. Loading settings for the scenario

C#

C++ (COM)

Step 3. Loading and preprocessing the images

C#

C++ (COM)

Step 4. Document recognition

C#

C++ (COM)

Step 5. Document export

C#

C++ (COM)

Step 6. Unloading ABBYY FineReader Engine

C#

C++ (COM)

Required resources

Additional optimization for specific tasks

See also