DEByE - Date extraction by example
Data & Knowledge Engineering
WebOQL: Restructuring Documents, Databases, and Webs
ICDE '98 Proceedings of the Fourteenth International Conference on Data Engineering
Data extraction and label assignment for web databases
WWW '03 Proceedings of the 12th international conference on World Wide Web
OLERA: Semisupervised Web-Data Extraction with Visual Support
IEEE Intelligent Systems
Generating form-based user interfaces for XML vocabularies
Proceedings of the 2005 ACM symposium on Document engineering
A Survey of Web Information Extraction Systems
IEEE Transactions on Knowledge and Data Engineering
XUIB: XML to user interface binding
Proceedings of the 10th ACM symposium on Document engineering
Hi-index | 0.00 |
Nowadays, the amount of information accessible on the web is huge. Although web users today expect a more integrated way to access information on the web, it is still rather difficult to "integrate" information from different web sites since most web pages are authored in HTML format, which is actually a presentation-oriented language and is usually considered unstructured. Today, there are many research works aiming at extracting information from web pages. Existing works typically transform the extracting results into structured or semi-structured data formats, thus other applications can further process the results to discover more useful information. Nevertheless, the unstructured nature of HTML makes the transformation process complex and can hardly be widely adopted. In this paper, an annotation-based HTML-to-XML ransformation technology is proposed. The mechanism is developed with both usability and simplicity in mind. With the proposed mechanism, ordinary web site developers simply add annotations to their web pages. Annotated web pages can then be processed by our software libraries and transformed into XML documents, which are machine-understandable. Software agents thus can be developed based on our technology.