Document Transformation System from Papers to XML Data Based on Pivot XML Document Method

  • Authors:
  • Yasuto Ishitani

  • Affiliations:
  • -

  • Venue:
  • ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 1
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper proposes a new method for document transformationusing OCR to generate various XML documentsfrom printed documents. The proposed method adopts a hierarchicaltransformation strategy based on a pivot XMLdocument. Firstly, document elements such as title, authors,abstract, headings, paragraphs, lists, captions, tablesand figures are extracted from document images. Secondly,the hierarchical structure of document elements isextracted and is described using a DOM tree. Thirdly, thisdocument structure is converted into a pivot XML documentdescribed as an XHTML document by an XML parser. Finally,this pivot XML document is transformed into the targetXML document by the XML parser with XSLT scripts orspecific programs. Experimental results show the method iseffective in transforming printed documents to various XMLdocuments.