Reverse Engineering for Web Data: From Visual to Semantic Structures

Authors:
Affiliations:
Venue:
ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Year:
2002

Citing 0
Cited 15

Datarover: a taxonomy based crawler for automated data extraction from data-intensive websites

WIDM '03 Proceedings of the 5th ACM international workshop on Web information and data management
Hearsay: enabling audio browsing on hypertext content

Proceedings of the 13th international conference on World Wide Web
OntoMiner: bootstrapping ontologies from overlapping domain specific web sites

Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters
Supervised learning for the legacy document conversion

Proceedings of the 2004 ACM symposium on Document engineering
WISDOM: Web Intrapage Informative Structure Mining Based on Document Object Model

IEEE Transactions on Knowledge and Data Engineering
Bootstrapping Semantic Annotation for Content-Rich HTML Documents

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Browsing fatigue in handhelds: semantic bookmarking spells relief

WWW '05 Proceedings of the 14th international conference on World Wide Web
BlackBoardNV: a system for enabling non-visual access to the blackboard course management system

Proceedings of the 7th international ACM SIGACCESS conference on Computers and accessibility
OntoMiner: automated metadata and instance mining from news websites

International Journal of Web and Grid Services
Automated Semantic Analysis of Schematic Data

World Wide Web
A probabilistic learning method for XML annotation of documents

IJCAI'05 Proceedings of the 19th international joint conference on Artificial intelligence
Automatic document structure detection for data integration

BIS'07 Proceedings of the 10th international conference on Business information systems
PIES: a web information extraction system using ontology and tag patterns

WAIM'05 Proceedings of the 6th international conference on Advances in Web-Age Information Management
From legacy documents to XML: a conversion framework

ECDL'05 Proceedings of the 9th European conference on Research and Advanced Technology for Digital Libraries
Extending ER models to capture database transformations to build data sets for data mining

Data & Knowledge Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Despite the advancement of XML, the majority of documents on the Web is still marked up with HTML for visual rendering purposes only, thus building a huge amount of "legacy" data. In order to facilitate querying Web based data in a way more efficient and effective than just keyword based retrieval, enriching such Web documents with both structure and semantics is necessary.This paper describes a novel approach to the integration of topic specific HTML documents into a repository of XML documents. In particular, we describe how topic specific HTML documents are transformed into XML documents. The proposed document transformation and semantic element tagging process utilizes document restructuring rules and minimum information about the topic in form of concepts. For the resulting XML documents, a majority schema is derived that describes common structures among the documents in the form of a DTD.We explore and discuss different techniques and rules for document conversion and majority schema discovery. We finally demonstrate the feasibility and effectiveness of our approach by applying it to a set of resume HTML documents gathered by a Web crawler.