Date search after high or low-quality recognition
This chapter describes several typical ways to create FlexiLayouts that would look for date fields on low-quality images. Such images are fairly common, with different scanning defects mostly caused by wrong scanning settings. For instance, the image may be too bright or too dark if brightness settings are not correct. As a result, some information from the image may be lost, or parts of the image may be noisy.
FlexiLayout Studio offers a special Date element that is used to detect dates. However, when creating a FlexiLayout you may find this element insufficient. This may happen when the date on the documents does not match any of the formats available in the Date element. For instance, FlexiLayout Studio allows next languagesEnglish, Czech, Danish, Dutch, Estonian, Finnish, French, German, Greek, Hungarian, Italian, Latvian, Lithuanian, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Turkish to be used in specifying the month in the date field. So, if the month is spelled out as a word in any language other than specified, the date will not correspond to any of the available formats, and thus you will not be able to detect it by means of a Date element.
Date search errors can also occur if the image has elements which cannot be removed. For example, the date may be underlined and there may be no visible gap between the date and the underline, crossed out or written in character cells with black separators (frames, combs). The inability to use the Date element for date search can also be caused by speckles in the search area, or by the fact that the date is not typed in but is filled in by hand.
The best way to organize a search for a date field is described in the sample project SearchOfDate.fsp (folder %public%\ABBYY\FlexiCapture\12.0\Samples\FLS\Tips and Tricks\Date).
There are 5 pages in the project:
- Page 1 - date format can be described by means of the Date element;
- Page 2 - the month in the date is written in French;
- Page 3 - the date is underlined;
- Page 4 - the date field is noisy;
- Page 5 - the date is filled in by hand.
Let us try and find the date on all the images including those where the date format is not supported by the element of type Date.
All the elements describing the date field are joined into a DateGroup element. First, this group has an element which looks for the name of the date field. In our project, this is an element of type Static Text named DateHeader with the only value 'Date:'.
Note.Setting up the search constraints for all the elements in the Relations section is not difficult and thus is not described here. You can look it up directly in the project.
Next we create an element of type Date named DateField. This element will search for date fields whose format is supported by the Date element.
As is seen in the project, the Date element can detect the date only on the first page.
To search for dates on the other pages, an element of type Character String is created. In our project this element is named DateAsString. All the characters that are likely to occur on the images are represented in this element by means of an alphabet.
Note.If the content in a data field can be structured and has a format which is not supported by the Date element, it is advisable, instead of specifying an alphabet, to describe this format with the help of a regular expression. However, you must be sure of the high quality of the processed images, as a regular expression presumes 100% match of the field and the described structure (alphabets, on the other hand, allow a certain percentage of errors. which is specified in the element's properties). Thus alphabets are a more flexible tool in the cases where accuracy of recognition can not be guaranteed. If you know that the month in the date will be written as a word in the same language as the pre-recognition language, then it may be advisable to divide the date field into three sections (day, month, year) and to search for the month field separately by using an element of type Static Text. Such a Static Text element should describe all possible variations of the month (e.g. full and abbreviated names of the month in the given language). Day and year fields will then be sought to the right and left of the month by means of elements of type Character String.
To optimize FlexiLayout matching, the following condition is provided in the Advanced pre-search relations field for the DateAsString element:
if (DateField.IsNull == FALSE) then Dontfind();
which is the same as:
if not DateField.IsNull then Dontfind();
This condition means that a search for the date as a character string will only be initiated if the date cannot be detected by using a Date element.
As is seen in the project, the element DateAsString finds the date on the rest of the project pages, where it could not be detected by the element of type Date.
On Page 4, however, the detected string contains only part of the date field. If you look at the pre-recognition results for the date field (by clicking Show Raw Objects on the toolbar), the reason for the partial detection becomes clear: the search area contains not only text objects, but other types of objects as well: Picture and Punctuation mark. Such a situation is typical for low-quality images - text objects are not always recognized during pre-recognition.
To find all the objects associated with the date field, an element of type Object Collection is created and named DateAsObjectCollection. All object types detected in the date field during pre-recognition are specified in the properties of the element. To optimize FlexiLayout matching, the following condition is provided in the Advanced pre-search relations field for the DateAsObjectCollection and the DateAsString elements:
if (DateField.IsNull == FALSE) then Dontfind();
Note.The condition if (DateAsString.IsNull == FALSE) then Dontfind() can't be added to the Advanced properties of the DateAsObjectCollection element, because, as is clear from the example, there may be situations when the detected string contains only a part of the date.
At this stage the creation of elements describing the search constraints for the date fields may be considered complete. The Group element SearchElements.DateGroup.AlternativeDateGroup, which consists of the DateField, DateAsString, and DateAsObjectCollection elements, is specified in the project tree as a Source element for the Date block. Since the Dontfind() method was used to describe the properties of the DateAsString and DateAsObjectCollection elements, the actual region of the detected block will match either the region found by the hypothesis for the Date element or the combined regions of the DateAsObjectCollection and the DateAsString elements. In the latter case, we expect the region of the DateAsString element to be part of the region of the DateAsObjectCollection element, so the resulting region will be the region of the DateAsObjectCollection element.
Note.In this case we can specify the Group element SearchElements.DateGroup.AlternativeDateGroup as a Source element, because the situation is relatively simple. The region of the group is a combination of the regions of its detected subelements. The Dontfind() method allows skipping the search for some of the subelements. Thus the region of the Group element SearchElements.DateGroup.AlternativeDateGroup will match the region of the subelement. In the given example, the Dontfind() method not only helps to optimize FlexiLayout matching, but also makes the description of blocks simpler.
Alternatively, we could use the code specified in the Expression section.
let dateGroup = SearchElements.DateGroup.AlternativeDateGroup;
if (dateGroup.DateField.IsNull == FALSE) then
outputRect = dateGroup.DateField.Rect;
outputRect = dateGroup.DateAsObjectCollection.Rect;
OutputRegion = outputRect;
The use of a Expression provides for additional options. For example, we could check if the region of the DateAsString element is really a part of the region of DateAsObjectCollection.
Note.Using a Character String element to look for a date field, without defining the string format or using an element of type Object Collection, like in the given example, can lead to good results if the search area of the date field can be clearly defined. But if there are several character strings in the search area, then the string format should be described by means of a regular expression or a narrower alphabet. Otherwise, the final hypothesis can be unsatisfactory. By using the element of type Character String, the user can limit the number of characters in a string, the number of word ends and the length of spaces to filter out wrong hypotheses. If an element of type Object Collection is used, the hypothesis will include all the objects on the image that are located within the search area and meet the constraints on the object size.