Capturing data from unstructured documents

A document includes fields to be filled in by hand or by means of a printing device. Documents may have one or more pages.

Documents can be divided into 'fixed' and 'semi-structured' documents. In the case of 'fixed' documents, the identical fields have exactly the same locations on all the documents in a batch. Fixed documents can be processed by means of documents processing applications which read information from the data fields and export it into database, document management systems, or archiving applications. Data are captured for such documents by means of Document Definition, which describes the locations of the fields and the type of information they may contain. One and the same Document Definition is used to capture data from all the documents in a given batch. It tells the documents processing application where to look for specific data on a document and how to make sure that the data have been captured correctly.

In the case of 'semi-structured' documents, the locations of identical data fields vary from one document to another. Additionally, not all fields may be present on all of the documents in a batch (e.g. one some documents may contain a signature field while others may not). A good example of a semi-structured document are various payment documents.

Letters, registration forms and legal documents are another good example of semi-structured documents. Documents of the same type will have similar structures but their may still be discrepancies among their fields. For example, letters will contain the name and address of the sender at the top of the page, legal documents will contain the names of the parties and their details, the effective date, etc.

Since the exact location of fields on semi-structured documents is not known in advance, data cannot be captured from such documents by means of a Document Definition. This means that traditional data capture systems cannot extract data from such documents.

ABBYY FlexiLayout Studio allows you to formally describe unstructured documents and provide the program with a search algorithm, enabling it to find data fields and extract information from these fields. A formal description relies on mutual relationships among the fields on an unstructured document and the nature of the data within the fields. Created descriptions can be tested on document images to make sure that information can be reliably extracted.

Formalized descriptions created by means of ABBYY FlexiLayout Studio are called FlexiLayouts. In order to start capturing data from unstructured documents by means of a FlexiLayout, you must export it into a data capture application such as ABBYY FlexiCapture. ABBYY FlexiCapture technology offers a wide range of data capture capabilities, enabling you to process practically any type of document.

12.04.2024 18:16:02

Please leave your feedback about this article

Name

E-mail

Comment

Your use of this site is conditioned on Your continued compliance with the Terms of Use.

Terms of Use

Disclaimer of Warranty

Limitation of Liability

Transmission and Submission of Information

Downloads

Use of Content

Trademarks

Links to Third-Party Sites

Foreign Legislation

Subscription Terms

Partner Subscription Terms

Capturing data from unstructured documents

Please leave your feedback about this article