Using NLP to process unstructured documents
Natural Language Processing (NLP) is a subfield of artificial intelligence and computational linguistics. NLP is concerned with computer analysis and synthesis of natural languages. One possible practical application of NLP is the extraction of meaningful data from text.
The way a document is processed depends on its structure. For our purposes, we can distinguish three types of documents: structured, semi-structured, and unstructured documents.
- Structured documents contain a set of well defined data fields whose design, number, and placement do not change from one document to another. Examples of structured documents include forms, questionnaires, and applications.
- Semi-structured documents contain a set of data fields whose design, number, and placement can vary significantly from one document to another. They are also sometimes called "flexible documents." One example of semi-structured documents is invoices, where the number of entries and formatting often depends on the issuing company.
- Unstructured documents contain information that is not structured in any way. They also do not contain explicit data fields. Examples of unstructured documents include contracts, letters, and orders.
For more information about document types, see Types of documents processed using ABBYY FlexiCapture.
NLP technology should be used to process unstructured documents. For example, NLP can be used to extract the following types of data from a contract: reference numbers, names of parties, important dates (signing date, effective date, term, and termination date), contract price, fees, terms of payment, and so on.
To extract information from tables, structured, and semi-structured documents, other methods should be used (for example, FlexiLayouts).
Extracting information from texts
ABBYY software products use NLP models to extract information from unstructured texts. An NLP model tells the program which entities should be extracted from a document. When you train an NLP model on sample documents, the subject area of your texts and the appropriate extraction algorithm are determined so that the information you need can be extracted more efficiently. The effort required to create an NLP model depends on the variety of your documents, the context available to the program, and the complexity and amount of the information that you need to extract.
Extracting data from unstructured texts requires a lot of computing power. Larger texts will take longer to analyze.
However, the necessary information can often be found on a certain page or in a certain paragraph of a very large text. The process of finding such useful parts of text is called segmentation. This process requires considerably less time and computing resources than entity extraction, so sometimes you may want to segment a document before extracting information from it. For more information about identifying useful segments, see Creating a segmentation NLP model.
To process unstructured documents using NLP, complete the following steps:
- Install the NLP module.
- Create a Document Definition.
- Create and train an NLP model.
- Alternatively, load an existing NLP model into your Document Definition.
4/12/2024 6:16:02 PM