Sample 4. Step 4: Analyzing the images to determine the order in which the elements should be detected
At this stage, you have to establish the following:
- Is there any pattern or method in the arrangement of the fields on the images?
- Which elements can be relied on to look for the data fields?
- In what order should we look for the elements? (This is important, because at each subsequent step we can only rely on the elements of the previous step.)
As we are dealing with a multi-page document, we must first establish which objects can be used to identify the first and last pages of the document. These objects can be described by means of special compound Header and Footer elements.
- The Header element must match only the first page of the document.
If a project also contains documents of other types, this elements will also be used as an identifier (i.e. a unique characteristic identifying this document type).
- The Footer element must match the last page of the document only. We recommend creating required subelements in this group to prevent the Footer element from being matched with any other document page.
Once you have analyzed the images, you will notice that:
- On the first page, there is a group of fields that consists of InvoiceNumber, InvoiceDate, DeliveryAddress. The name of the InvoiceNumber field always goes at the beginning of each document, whereas the InvoiceDate and DeliveryAddress are not always present.
- The InvoiceNumber and InvoiceDate fields can be located either to the right of their corresponding names or below them.
- For DeliveryAddress, we should also look either to the right or below the corresponding name, having limited the search area. Additionally, we will need an element to restrict the search area from below.
- Since on some of the images these fields have no values, you can speed up the matching process by specifying the following condition: do not look for the value of a field if the field's name have not been detected.
Note.The multitude of all pages of a document is called a multi-page canvas. A multi-page canvas is formed by joining all the pages of a document, top-down, without any gaps, the left boundary of all the pages lying on the same axis, which goes through the point (0, 0). The order in which the pages are joined together is determined by the order of the pages in the batch, therefore, we can only specify the start of the table (its header on the first page) and end of the table (its footer on the last page). The program will search for the table in the entire document, i.e. on the entire multi-page canvas.
5/25/2023 7:55:03 AM