Syntactic Segmentation and Labeling of Digitized Pages from Technical Journals
IEEE Transactions on Pattern Analysis and Machine Intelligence
Logical Structure Analysis of Document Images Based on Emergent Computation
ICDAR '99 Proceedings of the Fifth International Conference on Document Analysis and Recognition
Robust document image understanding technologies
Proceedings of the 1st ACM workshop on Hardcopy document processing
Structuring documents according to their table of contents
Proceedings of the 2005 ACM symposium on Document engineering
Optimized XY-Cut for Determining a Page Reading Order
ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
Framework for version control & dependency link of components & products in software product line
AIC'04 Proceedings of the 4th WSEAS International Conference on Applied Informatics and Communications
A probabilistic learning method for XML annotation of documents
IJCAI'05 Proceedings of the 19th international joint conference on Artificial intelligence
Learning to order: a relational approach
MCD'07 Proceedings of the 3rd ECML/PKDD international conference on Mining complex data
From layout to semantic: a reranking model for mapping web documents to mediated XML representations
Large Scale Semantic Access to Content (Text, Image, Video, and Sound)
Structure extraction from PDF-based book documents
Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
From legacy documents to XML: a conversion framework
ECDL'05 Proceedings of the 9th European conference on Research and Advanced Technology for Digital Libraries
Hi-index | 0.00 |
This paper proposes a new method for document transformationusing OCR to generate various XML documentsfrom printed documents. The proposed method adopts a hierarchicaltransformation strategy based on a pivot XMLdocument. Firstly, document elements such as title, authors,abstract, headings, paragraphs, lists, captions, tablesand figures are extracted from document images. Secondly,the hierarchical structure of document elements isextracted and is described using a DOM tree. Thirdly, thisdocument structure is converted into a pivot XML documentdescribed as an XHTML document by an XML parser. Finally,this pivot XML document is transformed into the targetXML document by the XML parser with XSLT scripts orspecific programs. Experimental results show the method iseffective in transforming printed documents to various XMLdocuments.