Datarover: a taxonomy based crawler for automated data extraction from data-intensive websites
WIDM '03 Proceedings of the 5th ACM international workshop on Web information and data management
Hearsay: enabling audio browsing on hypertext content
Proceedings of the 13th international conference on World Wide Web
OntoMiner: bootstrapping ontologies from overlapping domain specific web sites
Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters
Supervised learning for the legacy document conversion
Proceedings of the 2004 ACM symposium on Document engineering
WISDOM: Web Intrapage Informative Structure Mining Based on Document Object Model
IEEE Transactions on Knowledge and Data Engineering
Bootstrapping Semantic Annotation for Content-Rich HTML Documents
ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Browsing fatigue in handhelds: semantic bookmarking spells relief
WWW '05 Proceedings of the 14th international conference on World Wide Web
BlackBoardNV: a system for enabling non-visual access to the blackboard course management system
Proceedings of the 7th international ACM SIGACCESS conference on Computers and accessibility
OntoMiner: automated metadata and instance mining from news websites
International Journal of Web and Grid Services
Automated Semantic Analysis of Schematic Data
World Wide Web
A probabilistic learning method for XML annotation of documents
IJCAI'05 Proceedings of the 19th international joint conference on Artificial intelligence
Automatic document structure detection for data integration
BIS'07 Proceedings of the 10th international conference on Business information systems
PIES: a web information extraction system using ontology and tag patterns
WAIM'05 Proceedings of the 6th international conference on Advances in Web-Age Information Management
From legacy documents to XML: a conversion framework
ECDL'05 Proceedings of the 9th European conference on Research and Advanced Technology for Digital Libraries
Extending ER models to capture database transformations to build data sets for data mining
Data & Knowledge Engineering
Hi-index | 0.00 |
Despite the advancement of XML, the majority of documents on the Web is still marked up with HTML for visual rendering purposes only, thus building a huge amount of "legacy" data. In order to facilitate querying Web based data in a way more efficient and effective than just keyword based retrieval, enriching such Web documents with both structure and semantics is necessary.This paper describes a novel approach to the integration of topic specific HTML documents into a repository of XML documents. In particular, we describe how topic specific HTML documents are transformed into XML documents. The proposed document transformation and semantic element tagging process utilizes document restructuring rules and minimum information about the topic in form of concepts. For the resulting XML documents, a majority schema is derived that describes common structures among the documents in the form of a DTD.We explore and discuss different techniques and rules for document conversion and majority schema discovery. We finally demonstrate the feasibility and effectiveness of our approach by applying it to a set of resume HTML documents gathered by a Web crawler.