Developer-friendly annotation-based HTML-to-XML transformation technology

Authors:
Lendle Chun-Hsiung Tseng
Affiliations:
Lunghwa University of Science and Technology, Taoyuan, Taiwan Roc
Venue:
Proceedings of the 11th ACM symposium on Document engineering
Year:
2011

Citing 7
Cited 0

DEByE - Date extraction by example

Data & Knowledge Engineering
WebOQL: Restructuring Documents, Databases, and Webs

ICDE '98 Proceedings of the Fourteenth International Conference on Data Engineering
Data extraction and label assignment for web databases

WWW '03 Proceedings of the 12th international conference on World Wide Web
OLERA: Semisupervised Web-Data Extraction with Visual Support

IEEE Intelligent Systems
Generating form-based user interfaces for XML vocabularies

Proceedings of the 2005 ACM symposium on Document engineering
A Survey of Web Information Extraction Systems

IEEE Transactions on Knowledge and Data Engineering
XUIB: XML to user interface binding

Proceedings of the 10th ACM symposium on Document engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Nowadays, the amount of information accessible on the web is huge. Although web users today expect a more integrated way to access information on the web, it is still rather difficult to "integrate" information from different web sites since most web pages are authored in HTML format, which is actually a presentation-oriented language and is usually considered unstructured. Today, there are many research works aiming at extracting information from web pages. Existing works typically transform the extracting results into structured or semi-structured data formats, thus other applications can further process the results to discover more useful information. Nevertheless, the unstructured nature of HTML makes the transformation process complex and can hardly be widely adopted. In this paper, an annotation-based HTML-to-XML ransformation technology is proposed. The mechanism is developed with both usability and simplicity in mind. With the proposed mechanism, ordinary web site developers simply add annotations to their web pages. Annotated web pages can then be processed by our software libraries and transformed into XML documents, which are machine-understandable. Software agents thus can be developed based on our technology.