PODS '97 Proceedings of the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
PODS '97 Proceedings of the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
DIS '96 Proceedings of the fourth international conference on on Parallel and distributed information systems
WebOQL: Restructuring Documents, Databases, and Webs
ICDE '98 Proceedings of the Fourteenth International Conference on Data Engineering
WebView: A Tool for Retrieving Internal Structures and Extracting Information from HTML Documents
DASFAA '99 Proceedings of the Sixth International Conference on Database Systems for Advanced Applications
Looking at the Web through XML Glasses
COOPIS '99 Proceedings of the Fourth IECIS International Conference on Cooperative Information Systems
Hi-index | 0.00 |
XML is rapidly emerging, and yet there still exist numerous HTML documents on the Web. In this paper, we present a heuristic approach for converting HTML documents to XML documents. During the conversion process, we eliminate all the HTML elements in an HTML document from the resulting XML document since these elements are designed for the display of data exclusively, but retain the character data of each element along with the implicit hierarchy among the data. The proposed conversion approach extracts the data hierarchy of HTML documents as closely as possible with no human intervention. The approach can be adopted to construct the data hierarchy of an HTML document and to collect data in HTML documents into an XML repository.