Analyzing XML Result Files

Example 1. Matching output pages with their respective input pages

The XML result file in this example was generated for a job which created two output documents from four input documents. By examining the file identifier FileId and the page identifier PageID we can see that the first output file, file1.docx, is made up of pages from the input files file1.pdf and file2.pdf, and from the first page of the input file file3.tif. The second output file, file3.docx, is made up of pages of the input file file4.tif and the second page of the input file file3.tif.

For the sake of convenience, the files and their pages are shown in different colors in the figure below.

<XmlResult Id="{BFF37808-4FA1-4FC3-949A-4BF7FC64FEC4}" IsFailed="false" ...>
 <ExportParams DocumentSeparationMethod="SeparateByFixedNumberOfPages" PagesPerDocument="3" XMLResultPublishingMethod="XMLResultToFolder" ...>
   <XMLResultLocation>D:\FRS\Workflow\Output Folder</XMLResultLocation>
   ...
 </ExportParams>
 <Name>file_1.pdf, file_2.pdf, file_3.pdf, file_4.pdf</Name>
 <InputFile FileName="file_1.pdf" Id="21827" ...>
   <Page Id="{B01336B9-AE8B-479D-AE91-8478A7451E2A}" PageNumber="0" ...>
     ...
   </Page>
   ...
 </InputFile>
 <InputFile FileName="file_2.pdf" Id="21828" ...>
   <Page Id="{6F834F38-43D6-4FE5-A983-084986809143}" PageNumber="0" ...>
     ...
   </Page>
   ...
 </InputFile>
 <InputFile FileName="file_3.pdf" Id="21829" ...>
   <Page Id="{CE5113B2-0593-462D-A668-D0E6E92C21B0}" PageNumber="0" ...>
     ...
   </Page>
   <Page Id="{E97D68CE-FA78-44B7-8193-8A3A8DE3DB9C}" PageNumber="1" ...>
     ...
   </Page>
   ...
 </InputFile>
 <InputFile FileName="file_4.pdf" Id="21830" ...>
   <Page Id="{913CCBD2-AC7F-495D-B964-1CF6668B532A}" PageNumber="0" ...>
     ...
   </Page>
   <Page Id="{DC823DCD-9E71-412E-9690-1BD55B995736}" PageNumber="1" ...>
     ...
   </Page>
   ...
 </InputFile>
 <JobDocument Name="file_1.pdf, file_2.pdf, file_3.pdf (page 1)" Id="{75DDDD35-CD3D-4CE5-A671-AE31D8277538}" ...>
   <OutputDocuments OutputLocation="D:\FRS\Workflow\Output Folder" ExportFormat="Docx" ...>
     <FileName>file_1.docx</FileName>
   </OutputDocuments>
   <Pages>
     <FileId>21827</FileId>
     <PageId>{B01336B9-AE8B-479D-AE91-8478A7451E2A}</PageId>
   </Pages>
   <Pages>
     <FileId>21828</FileId>
     <PageId>{6F834F38-43D6-4FE5-A983-084986809143}</PageId>
   </Pages>
   <Pages>
     <FileId>21829</FileId>
     <PageId>{CE5113B2-0593-462D-A668-D0E6E92C21B0}</PageId>
   </Pages>
   ...
 </JobDocument>
 <JobDocument Name="file_3.pdf (page 2), file_4.pdf" Id="{DC25C928-8070-4BF2-98B8-7EE608410E1D}" ...>
   <OutputDocuments OutputLocation="D:\FRS\Workflow\Output Folder" ExportFormat="Docx" ...>
     <FileName>file_3.docx</FileName>
   </OutputDocuments>
   <Pages>
     <FileId>21829</FileId>
     <PageId>{E97D68CE-FA78-44B7-8193-8A3A8DE3DB9C}</PageId>
   </Pages>
   <Pages>
     <FileId>21830</FileId>
     <PageId>{913CCBD2-AC7F-495D-B964-1CF6668B532A}</PageId>
   </Pages>
   <Pages>
     <FileId>21830</FileId>
     <PageId>{DC823DCD-9E71-412E-9690-1BD55B995736}</PageId>
   </Pages>
   ...
 </JobDocument>
 ...
</XmlResult>

Example 2. An XML result file obtained by copying an input file

The XML result file below was generated for an input file named file.png.

By examining the values of attributes Id and IsFailed of the <XmlResult> tag, we can see that the job was unique and was executed successfully. Next, we examine the list of input and output files and see that for one inpuit file, file.png, one output file was created, also named file.png. The name of the output file is given in the <OutputDocuments> tag which is embedded in the <InputFile> tag. The extension ".doc" does not fit the extension specified in the mask, which means that the output file file.png was obtained by merely copying the input file Attachment.doc, rather than by performing OCR.

<XmlResult Id="{070F0101-2625-46DB-AE99-B8C7FF48F3C3}" IsFailed="false" ...>
 <ExportParams XMLResultPublishingMethod="XMLResultToFolder" ...>
   <XMLResultLocation>D:\FRS\Workflow\Output Folder</XMLResultLocation>
   ...
 </ExportParams>
 <Name>file.png</Name>
 <InputFile FileName="file.png" Id="21832" ...>
   <OutputDocuments OutputLocation="D:\FRS\Workflow\Output Folder" ExportFormat="NoConversion" ...>
     <FileName>file.png</FileName>
   </OutputDocuments>
 </InputFile>
 ...
</XmlResult>

Example 3. Processing files and comparing the XML result files in the Output folder and in the Exceptions folder

An 4-page file named Invoice.pdf had to be processed as follows:

  1. Recognize the file.
  2. Separate Invoice.pdf into documents of 2 pages each.
  3. Index the documents.
  4. Save each indexed document in DOC.

Note. During processing, the second document obtained be separating Invoice.pdf was rejected by an indexing operator because no invoice number was found in that document.

Next, we examine the XML result files obtained.

  1. In the Exceptions folder, we can find Invoices.pdf.result.xml.

<XmlResult Id="{2481566F-AA4E-47D4-96FD-E97AB6DCE898}" IsFailed="true" ...>
 <ExportParams DocumentSeparationMethod="SeparateByFixedNumberOfPages" PagesPerDocument="2" XMLResultPublishingMethod="XMLResultToFolder" ...>
   <XMLResultLocation>D:\FRS\Workflow\Output Folder</XMLResultLocation>
   ...
 </ExportParams>
 <Name>Invoice.pdf</Name>
 <InputFile FileName="Invoice.pdf" Id="21833" ...>
   <OutputDocuments OutputLocation="D:\FRS\Workflow\Exceptions Folder" ExportFormat="NoConversion" ...>
     <FileName>Invoice.pdf</FileName>
   </OutputDocuments>
   <Page Id="{0B8C1DF4-3FF0-46A3-8CF4-67C7A94FD68E}" PageNumber="0" ...>
     ...
   </Page>
   <Page Id="{F95CCC65-D6C1-43AA-8714-F4520E912DE9}" PageNumber="1" ...>
     ...
   </Page>
   <Page Id="{F52BE173-B2C3-4D52-B05C-45A7847E1F5C}" PageNumber="2" ...>
     ...
   </Page>
   <Page Id="{1A606376-ACC1-4D14-8DE3-273777551C09}" PageNumber="3" ...>
     ...
   </Page>
   ...
 </InputFile>
 <JobDocument Name="Invoice.pdf (pages 3,4)" Id="{5D03D01F-7DB6-490A-BF18-441F2AF9533E}" ...>
   <IsFailed>true</IsFailed>
   <Message Type="Error" Code="35" ...>The document #2 was rejected by the Indexing Station operator ...</Message>
   <Pages>
     <FileId>21833</FileId>
     <PageId>{F52BE173-B2C3-4D52-B05C-45A7847E1F5C}</PageId>
   </Pages>
   <Pages>
     <FileId>21833</FileId>
     <PageId>{1A606376-ACC1-4D14-8DE3-273777551C09}</PageId>
   </Pages>
   ...
 </JobDocument>
 ...
</XmlResult>

The value of the IsFailed attribute of the <XmlResult> tag tells us that the job failed. Now we need to find the unique identifier of the job. This ID will help us find files which may have been processed successfully.

The <InputFile> tag tells us that the Input folder contains our Invoice.pdf, which has 4 pages. The <InputFile> tag contains an embedded <OutputDocuments> tag, which means that the input file was merely copied to the Output folder (as required by item (e) above).

The <FileName> and <OutputLocation> tags point to the saved copy of the file.

The FileId and PageId attributed of the <JobDocument> tag tell us that the document whose processing resulted in the error "Job has been discarded by Indexing Station operator" consists of the second and third pages of the input file.

  1. In the Output folder, we can find Invoices.pdf.result.xml.

<XmlResult Id="{2481566F-AA4E-47D4-96FD-E97AB6DCE898}" IsFailed="false" ...>
 <ExportParams DocumentSeparationMethod="SeparateByFixedNumberOfPages" PagesPerDocument="2" XMLResultPublishingMethod="XMLResultToFolder" ...>
   <XMLResultLocation>D:\FRS\Workflow\Output Folder</XMLResultLocation>
   ...
 </ExportParams>
 <Name>Invoice.pdf</Name>
 <InputFile FileName="Invoice.pdf" Id="21833" ...>
   <Page Id="{0B8C1DF4-3FF0-46A3-8CF4-67C7A94FD68E}" PageNumber="0" ...>
     ...
   </Page>
   <Page Id="{F95CCC65-D6C1-43AA-8714-F4520E912DE9}" PageNumber="1" ...>
     ...
   </Page>
   <Page Id="{F52BE173-B2C3-4D52-B05C-45A7847E1F5C}" PageNumber="2" ...>
     ...
   </Page>
   <Page Id="{1A606376-ACC1-4D14-8DE3-273777551C09}" PageNumber="3" ...>
     ...
   </Page>
   ...
 </InputFile>
 <JobDocument Name="Invoice.pdf (pages 1,2)" Id="{FF05AABB-BBA4-49A7-AC71-2E2C8916BD72}" ...>
   <IsFailed>false</IsFailed>
   <Message Type="Information" Code="5" ...>The document #1 was accepted by the Indexing Station operator ...</Message>
   <OutputDocuments OutputLocation="D:\FRS\Workflow\Output Folder" ExportFormat="Docx" ...>
     <FileName>Invoice.docx</FileName>
   </OutputDocuments>
   <Pages>
     <FileId>21833</FileId>
     <PageId>{0B8C1DF4-3FF0-46A3-8CF4-67C7A94FD68E}</PageId>
   </Pages>
   <Pages>
     <FileId>21833</FileId>
     <PageId>{F95CCC65-D6C1-43AA-8714-F4520E912DE9}</PageId>
   </Pages>
   ...
 </JobDocument>
 <JobDocument Name="Invoice.pdf (pages 3,4)" Id="{5D03D01F-7DB6-490A-BF18-441F2AF9533E}" ...>
   <IsFailed>true</IsFailed>
   <Message Type="Error" Code="35" ...>The document #2 was rejected by the Indexing Station operator ...</Message>
   <Pages>
     <FileId>21833</FileId>
     <PageId>{F52BE173-B2C3-4D52-B05C-45A7847E1F5C}</PageId>
   </Pages>
   <Pages>
     <FileId>21833</FileId>
     <PageId>{1A606376-ACC1-4D14-8DE3-273777551C09}</PageId>
   </Pages>
   ...
 </JobDocument>
 ...
</XmlResult>

The value of the Id attribute of the <XmlResult> tag is the same as the value in Invoices.pdf.result.xml, found in the Exceptions folder. In other words, the job we are examining is a continuation of the job examined in step 1 above.

The value of the IsFailed attribute of the <XmlResult> tag tells us that this part of the job was executed successfully. The Id attribute of the <InputFile> tag tells us that the job involved the same 4-page Invoice.pdf file.

Next follows a set of <JobDocument> tags, each corresponding to an output document obtained by separating the input files into documents (as required by item (b) above).

In each <JobDocument>, the <OutputDocuments> tags contain the input files obtained by processing - Invoice.doc and Invoice001.pdf (as the Output folder already contained a file named Invoice.pdf, the program created a copy of the input file).

In each <JobDocument>, the <Pages> tags provide unique page identifiers to tell us from which pages of which input files the given output document was created. Thus, we can use the <JobDocument> tags to match the input files with their respective output files.

Note. The second <JobDocument> tag does not contain any <OutputDocument> tags. Instead, it contains an <Error> tag with the value Job has been discarded by Indexing Station operator. This means that an error occurred when processing this document and the program placed an XML result file into the Exceptions folder.

See also

XML Result

29.08.2023 11:55:30

Please leave your feedback about this article

Usage of Cookies. In order to optimize the website functionality and improve your online experience ABBYY uses cookies. You agree to the usage of cookies when you continue using this site. Further details can be found in our Privacy Notice.