Document Transformation System from Papers to XML Data Based on Pivot XML Document Method

Authors:
Yasuto Ishitani
Affiliations:
-
Venue:
ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 1
Year:
2003

Citing 2
Cited 10

Syntactic Segmentation and Labeling of Digitized Pages from Technical Journals

IEEE Transactions on Pattern Analysis and Machine Intelligence
Logical Structure Analysis of Document Images Based on Emergent Computation

ICDAR '99 Proceedings of the Fifth International Conference on Document Analysis and Recognition

Robust document image understanding technologies

Proceedings of the 1st ACM workshop on Hardcopy document processing
Structuring documents according to their table of contents

Proceedings of the 2005 ACM symposium on Document engineering
Optimized XY-Cut for Determining a Page Reading Order

ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
Table Structure Analysis Based on Cell Classification and Cell Modification for XML Document Transformation

ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
Framework for version control & dependency link of components & products in software product line

AIC'04 Proceedings of the 4th WSEAS International Conference on Applied Informatics and Communications
A probabilistic learning method for XML annotation of documents

IJCAI'05 Proceedings of the 19th international joint conference on Artificial intelligence
Learning to order: a relational approach

MCD'07 Proceedings of the 3rd ECML/PKDD international conference on Mining complex data
From layout to semantic: a reranking model for mapping web documents to mediated XML representations

Large Scale Semantic Access to Content (Text, Image, Video, and Sound)
Structure extraction from PDF-based book documents

Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
From legacy documents to XML: a conversion framework

ECDL'05 Proceedings of the 9th European conference on Research and Advanced Technology for Digital Libraries

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper proposes a new method for document transformationusing OCR to generate various XML documentsfrom printed documents. The proposed method adopts a hierarchicaltransformation strategy based on a pivot XMLdocument. Firstly, document elements such as title, authors,abstract, headings, paragraphs, lists, captions, tablesand figures are extracted from document images. Secondly,the hierarchical structure of document elements isextracted and is described using a DOM tree. Thirdly, thisdocument structure is converted into a pivot XML documentdescribed as an XHTML document by an XML parser. Finally,this pivot XML document is transformed into the targetXML document by the XML parser with XSLT scripts orspecific programs. Experimental results show the method iseffective in transforming printed documents to various XMLdocuments.