English (English) - Change language

Sample 4. Step 4: Analyzing the images to determine the order in which the elements should be detected

At this stage, you have to establish the following:

  • Is there any pattern or method in the arrangement of the fields on the images?
  • Which elements can be relied on to look for the data fields?
  • In what order should we look for the elements? (This is important, because at each subsequent step we can only rely on the elements of the previous step.)

As we are dealing with a multi-page document, we must first establish which objects can be used to identify the first and last pages of the document. These objects can be described by means of special compound Header and Footer elements.

  • The Header element must match only the first page of the document.

If a project also contains documents of other types, this elements will also be used as an identifier (i.e. a unique characteristic identifying this document type).

  • The Footer element must match the last page of the document only. We recommend creating required subelements in this group to prevent the Footer element from being matched with any other document page.

Once you have analyzed the images, you will notice that:

  1. On the first page, there is a group of fields that consists of InvoiceNumber, InvoiceDate, DeliveryAddress. The name of the InvoiceNumber field always goes at the beginning of each document, whereas the InvoiceDate and DeliveryAddress are not always present.
    • The InvoiceNumber and InvoiceDate fields can be located either to the right of their corresponding names or below them.
    • For DeliveryAddress, we should also look either to the right or below the corresponding name, having limited the search area. Additionally, we will need an element to restrict the search area from below.
    • Since on some of the images these fields have no values, you can speed up the matching process by specifying the following condition: do not look for the value of a field if the field's name have not been detected.
  2. We can use this group of fields as an identifier for our document. We will describe these fields as part of a compound Header element named InvoiceHeader.
  3. The last page of the document has the words TOTAL AMOUNT MUST, Carried over, Total CHF, TOTAL below the table. However, these words may also occur elsewhere on the document (e.g. in the name or in the body of the table). To find these words, we will need to use additional reference elements (e.g. table column names). These reference elements will help us restrict the search area.
  4. The elements describing the last page of the document will be part of a compound Footer element named InvoiceFooter.
  5. For the Footer element to match only the last page of the document, it must contain a required element. As the words identifying the last page (see 3 above) occur on each last page of each document, we will make the element that describes them a required element.
  6. The table (name it InvoiceTable) starts on the first page and ends on the last one. Additionally, the table is always preceded by the column names on the first page. To identify the end of the table (on the last page), we will use an auxiliary element (e.g. the required element from the InvoiceFooter group).
    Note.The multitude of all pages of a document is called a multi-page canvas. A multi-page canvas is formed by joining all the pages of a document, top-down, without any gaps, the left boundary of all the pages lying on the same axis, which goes through the point (0, 0). The order in which the pages are joined together is determined by the order of the pages in the batch, therefore, we can only specify the start of the table (its header on the first page) and end of the table (its footer on the last page). The program will search for the table in the entire document, i.e. on the entire multi-page canvas.
  7. We will look for the name of the company in the Company field always on the first page and always in the first upper third of the page.
  8. The name of the Total Amount field is always located on the last page, below the table. The value of the field is located either to the right of the name or below it.

10/9/2020 8:50:41 AM


Please leave your feedback about this article