Analyzing XML Result Files
Example 1. Matching output pages with their respective input pages
The XML result file in this example was generated for a job which created two output documents from four input documents. By examining the file identifier FileId and the page identifier PageID we can see that the first output file, file1.docx, is made up of pages from the input files file1.pdf and file2.pdf, and from the first page of the input file file3.tif. The second output file, file3.docx, is made up of pages of the input file file4.tif and the second page of the input file file3.tif.
For the sake of convenience, the files and their pages are shown in different colors in the figure below.
<XmlResult
Id="{BFF37808-4FA1-4FC3-949A-4BF7FC64FEC4}"
IsFailed="false"
...>
<ExportParams
DocumentSeparationMethod="SeparateByFixedNumberOfPages"
PagesPerDocument="3"
XMLResultPublishingMethod="XMLResultToFolder"
...>
<XMLResultLocation>D:\FRS\Workflow\Output Folder</XMLResultLocation>
...
</ExportParams>
<Name>file_1.pdf, file_2.pdf, file_3.pdf, file_4.pdf</Name>
<InputFile
FileName="file_1.pdf"
Id="21827"
...>
<Page
Id="{B01336B9-AE8B-479D-AE91-8478A7451E2A}"
PageNumber="0"
...>
...
</Page>
...
</InputFile>
<InputFile
FileName="file_2.pdf"
Id="21828"
...>
<Page
Id="{6F834F38-43D6-4FE5-A983-084986809143}"
PageNumber="0"
...>
...
</Page>
...
</InputFile>
<InputFile
FileName="file_3.pdf"
Id="21829"
...>
<Page
Id="{CE5113B2-0593-462D-A668-D0E6E92C21B0}"
PageNumber="0"
...>
...
</Page>
<Page
Id="{E97D68CE-FA78-44B7-8193-8A3A8DE3DB9C}"
PageNumber="1"
...>
...
</Page>
...
</InputFile>
<InputFile
FileName="file_4.pdf"
Id="21830"
...>
<Page
Id="{913CCBD2-AC7F-495D-B964-1CF6668B532A}"
PageNumber="0"
...>
...
</Page>
<Page
Id="{DC823DCD-9E71-412E-9690-1BD55B995736}"
PageNumber="1"
...>
...
</Page>
...
</InputFile>
<JobDocument
Name="file_1.pdf, file_2.pdf, file_3.pdf (page 1)"
Id="{75DDDD35-CD3D-4CE5-A671-AE31D8277538}"
...>
<OutputDocuments
OutputLocation="D:\FRS\Workflow\Output Folder"
ExportFormat="Docx"
...>
<FileName>file_1.docx</FileName>
</OutputDocuments>
<Pages>
<FileId>21827</FileId>
<PageId>{B01336B9-AE8B-479D-AE91-8478A7451E2A}</PageId>
</Pages>
<Pages>
<FileId>21828</FileId>
<PageId>{6F834F38-43D6-4FE5-A983-084986809143}</PageId>
</Pages>
<Pages>
<FileId>21829</FileId>
<PageId>{CE5113B2-0593-462D-A668-D0E6E92C21B0}</PageId>
</Pages>
...
</JobDocument>
<JobDocument
Name="file_3.pdf (page 2), file_4.pdf"
Id="{DC25C928-8070-4BF2-98B8-7EE608410E1D}"
...>
<OutputDocuments
OutputLocation="D:\FRS\Workflow\Output Folder"
ExportFormat="Docx"
...>
<FileName>file_3.docx</FileName>
</OutputDocuments>
<Pages>
<FileId>21829</FileId>
<PageId>{E97D68CE-FA78-44B7-8193-8A3A8DE3DB9C}</PageId>
</Pages>
<Pages>
<FileId>21830</FileId>
<PageId>{913CCBD2-AC7F-495D-B964-1CF6668B532A}</PageId>
</Pages>
<Pages>
<FileId>21830</FileId>
<PageId>{DC823DCD-9E71-412E-9690-1BD55B995736}</PageId>
</Pages>
...
</JobDocument>
...
</XmlResult>
Example 2. An XML result file obtained by copying an input file
The XML result file below was generated for an input file named file.png.
By examining the values of attributes Id and IsFailed of the <XmlResult> tag, we can see that the job was unique and was executed successfully. Next, we examine the list of input and output files and see that for one inpuit file, file.png, one output file was created, also named file.png. The name of the output file is given in the <OutputDocuments> tag which is embedded in the <InputFile> tag. The extension ".doc" does not fit the extension specified in the mask, which means that the output file file.png was obtained by merely copying the input file Attachment.doc, rather than by performing OCR.
<XmlResult
Id="{070F0101-2625-46DB-AE99-B8C7FF48F3C3}"
IsFailed="false"
...>
<ExportParams
XMLResultPublishingMethod="XMLResultToFolder"
...>
<XMLResultLocation>D:\FRS\Workflow\Output Folder</XMLResultLocation>
...
</ExportParams>
<Name>file.png</Name>
<InputFile
FileName="file.png"
Id="21832"
...>
<OutputDocuments
OutputLocation="D:\FRS\Workflow\Output Folder"
ExportFormat="NoConversion"
...>
<FileName>file.png</FileName>
</OutputDocuments>
</InputFile>
...
</XmlResult>
Example 3. Processing files and comparing the XML result files in the Output folder and in the Exceptions folder
An 4-page file named Invoice.pdf had to be processed as follows:
- Recognize the file.
- Separate Invoice.pdf into documents of 2 pages each.
- Index the documents.
- Save each indexed document in DOC.
Note. During processing, the second document obtained be separating Invoice.pdf was rejected by an indexing operator because no invoice number was found in that document.
Next, we examine the XML result files obtained.
- In the Exceptions folder, we can find Invoices.pdf.result.xml.
<XmlResult
Id="{2481566F-AA4E-47D4-96FD-E97AB6DCE898}"
IsFailed="true"
...>
<ExportParams
DocumentSeparationMethod="SeparateByFixedNumberOfPages"
PagesPerDocument="2"
XMLResultPublishingMethod="XMLResultToFolder"
...>
<XMLResultLocation>D:\FRS\Workflow\Output Folder</XMLResultLocation>
...
</ExportParams>
<Name>Invoice.pdf</Name>
<InputFile
FileName="Invoice.pdf"
Id="21833"
...>
<OutputDocuments
OutputLocation="D:\FRS\Workflow\Exceptions Folder"
ExportFormat="NoConversion"
...>
<FileName>Invoice.pdf</FileName>
</OutputDocuments>
<Page
Id="{0B8C1DF4-3FF0-46A3-8CF4-67C7A94FD68E}"
PageNumber="0"
...>
...
</Page>
<Page
Id="{F95CCC65-D6C1-43AA-8714-F4520E912DE9}"
PageNumber="1"
...>
...
</Page>
<Page
Id="{F52BE173-B2C3-4D52-B05C-45A7847E1F5C}"
PageNumber="2"
...>
...
</Page>
<Page
Id="{1A606376-ACC1-4D14-8DE3-273777551C09}"
PageNumber="3"
...>
...
</Page>
...
</InputFile>
<JobDocument
Name="Invoice.pdf (pages 3,4)"
Id="{5D03D01F-7DB6-490A-BF18-441F2AF9533E}"
...>
<IsFailed>true</IsFailed>
<Message
Type="Error"
Code="35"
...>The document #2 was rejected by the Indexing Station operator ...</Message>
<Pages>
<FileId>21833</FileId>
<PageId>{F52BE173-B2C3-4D52-B05C-45A7847E1F5C}</PageId>
</Pages>
<Pages>
<FileId>21833</FileId>
<PageId>{1A606376-ACC1-4D14-8DE3-273777551C09}</PageId>
</Pages>
...
</JobDocument>
...
</XmlResult>
The value of the IsFailed attribute of the <XmlResult> tag tells us that the job failed. Now we need to find the unique identifier of the job. This ID will help us find files which may have been processed successfully.
The <InputFile> tag tells us that the Input folder contains our Invoice.pdf, which has 4 pages. The <InputFile> tag contains an embedded <OutputDocuments> tag, which means that the input file was merely copied to the Output folder (as required by item (e) above).
The <FileName> and <OutputLocation> tags point to the saved copy of the file.
The FileId and PageId attributed of the <JobDocument> tag tell us that the document whose processing resulted in the error "Job has been discarded by Indexing Station operator" consists of the second and third pages of the input file.
- In the Output folder, we can find Invoices.pdf.result.xml.
<XmlResult
Id="{2481566F-AA4E-47D4-96FD-E97AB6DCE898}"
IsFailed="false"
...>
<ExportParams
DocumentSeparationMethod="SeparateByFixedNumberOfPages"
PagesPerDocument="2"
XMLResultPublishingMethod="XMLResultToFolder"
...>
<XMLResultLocation>D:\FRS\Workflow\Output Folder</XMLResultLocation>
...
</ExportParams>
<Name>Invoice.pdf</Name>
<InputFile
FileName="Invoice.pdf"
Id="21833"
...>
<Page
Id="{0B8C1DF4-3FF0-46A3-8CF4-67C7A94FD68E}"
PageNumber="0"
...>
...
</Page>
<Page
Id="{F95CCC65-D6C1-43AA-8714-F4520E912DE9}"
PageNumber="1"
...>
...
</Page>
<Page
Id="{F52BE173-B2C3-4D52-B05C-45A7847E1F5C}"
PageNumber="2"
...>
...
</Page>
<Page
Id="{1A606376-ACC1-4D14-8DE3-273777551C09}"
PageNumber="3"
...>
...
</Page>
...
</InputFile>
<JobDocument
Name="Invoice.pdf (pages 1,2)"
Id="{FF05AABB-BBA4-49A7-AC71-2E2C8916BD72}"
...>
<IsFailed>false</IsFailed>
<Message
Type="Information"
Code="5"
...>The document #1 was accepted by the Indexing Station operator ...</Message>
<OutputDocuments
OutputLocation="D:\FRS\Workflow\Output Folder"
ExportFormat="Docx"
...>
<FileName>Invoice.docx</FileName>
</OutputDocuments>
<Pages>
<FileId>21833</FileId>
<PageId>{0B8C1DF4-3FF0-46A3-8CF4-67C7A94FD68E}</PageId>
</Pages>
<Pages>
<FileId>21833</FileId>
<PageId>{F95CCC65-D6C1-43AA-8714-F4520E912DE9}</PageId>
</Pages>
...
</JobDocument>
<JobDocument
Name="Invoice.pdf (pages 3,4)"
Id="{5D03D01F-7DB6-490A-BF18-441F2AF9533E}"
...>
<IsFailed>true</IsFailed>
<Message
Type="Error"
Code="35"
...>The document #2 was rejected by the Indexing Station operator ...</Message>
<Pages>
<FileId>21833</FileId>
<PageId>{F52BE173-B2C3-4D52-B05C-45A7847E1F5C}</PageId>
</Pages>
<Pages>
<FileId>21833</FileId>
<PageId>{1A606376-ACC1-4D14-8DE3-273777551C09}</PageId>
</Pages>
...
</JobDocument>
...
</XmlResult>
The value of the Id attribute of the <XmlResult> tag is the same as the value in Invoices.pdf.result.xml, found in the Exceptions folder. In other words, the job we are examining is a continuation of the job examined in step 1 above.
The value of the IsFailed attribute of the <XmlResult> tag tells us that this part of the job was executed successfully. The Id attribute of the <InputFile> tag tells us that the job involved the same 4-page Invoice.pdf file.
Next follows a set of <JobDocument> tags, each corresponding to an output document obtained by separating the input files into documents (as required by item (b) above).
In each <JobDocument>, the <OutputDocuments> tags contain the input files obtained by processing - Invoice.doc and Invoice001.pdf (as the Output folder already contained a file named Invoice.pdf, the program created a copy of the input file).
In each <JobDocument>, the <Pages> tags provide unique page identifiers to tell us from which pages of which input files the given output document was created. Thus, we can use the <JobDocument> tags to match the input files with their respective output files.
Note. The second <JobDocument> tag does not contain any <OutputDocument> tags. Instead, it contains an <Error> tag with the value Job has been discarded by Indexing Station operator. This means that an error occurred when processing this document and the program placed an XML result file into the Exceptions folder.
See also
26.03.2024 13:49:49