An automated approach for retrieving hierarchical data from HTML tables

Authors:
Seung-Jin Lim;Yiu-Kai Ng
Affiliations:
Computer Science Department, Brigham Young University, Provo, Utah;Computer Science Department, Brigham Young University, Provo, Utah
Venue:
Proceedings of the eighth international conference on Information and knowledge management
Year:
1999

Citing 5
Cited 19

Formal models of Web queries

PODS '97 Proceedings of the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
WebOQL: Restructuring Documents, Databases, and Webs

ICDE '98 Proceedings of the Fourteenth International Conference on Data Engineering
To Weave the Web

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
W3QS: A Query System for the World-Wide Web

VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
WebView: A Tool for Retrieving Internal Structures and Extracting Information from HTML Documents

DASFAA '99 Proceedings of the Sixth International Conference on Database Systems for Advanced Applications

Computational aspects of resilient data extraction from semistructured sources (extended abstract)

PODS '00 Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Learning to extract hierarchical information from semi-structured documents

Proceedings of the ninth international conference on Information and knowledge management
Function-based object model towards website adaptation

Proceedings of the 10th international conference on World Wide Web
Automatically Extracting Ontologically Specified Data from HTML Tables of Unknown Structure

ER '02 Proceedings of the 21st International Conference on Conceptual Modeling
An Adaptive Web Content Delivery System

AH '00 Proceedings of the International Conference on Adaptive Hypermedia and Adaptive Web-Based Systems
Visual Based Content Understanding towards Web Adaptation

AH '02 Proceedings of the Second International Conference on Adaptive Hypermedia and Adaptive Web-Based Systems
Automating the extraction of data from HTML tables with unknown structure

Data & Knowledge Engineering - Special issue: ER 2002
Extracting logical structures from HTML tables

Computer Standards & Interfaces
On designing a market monitoring web agent system

Proceedings of the 10th International Conference on Information Integration and Web-based Applications & Services
Automatic hidden-web table interpretation, conceptualization, and semantic annotation

Data & Knowledge Engineering
Enabling Interactive Access to Web Tables

Proceedings of the 13th International Conference on Human-Computer Interaction. Part I: New Trends
Acoustic Rendering of Data Tables Using Earcons and Prosody for Document Accessibility

UAHCI '09 Proceedings of the 5th International Conference on Universal Access in Human-Computer Interaction. Part III: Applications and Services
Enhancing browsing experience of table and image elements in web pages

International Conference on Multimodal Interfaces and the Workshop on Machine Learning for Multimodal Interaction
OSD-DB: a military logistics mobile database

APWeb'11 Proceedings of the 13th Asia-Pacific web conference on Web technologies and applications
Enabling efficient browsing and manipulation of web tables on smartphone

HCII'11 Proceedings of the 14th international conference on Human-computer interaction: towards mobile and intelligent interaction environments - Volume Part III
Acoustic modeling of dialogue elements for document accessibility

UAHCI'11 Proceedings of the 6th international conference on Universal access in human-computer interaction: applications and services - Volume Part IV
An XML approach to semantically extract data from HTML tables

DEXA'05 Proceedings of the 16th international conference on Database and Expert Systems Applications
Diction based prosody modeling in table-to-speech synthesis

TSD'05 Proceedings of the 8th international conference on Text, Speech and Dialogue
Adapting data table to improve web accessibility

Proceedings of the 10th International Cross-Disciplinary Conference on Web Accessibility

Quantified Score

Hi-index	0.00

Visualization

Abstract

Among the HTML elements, HTML tables [RHJ98] encapsulate hierarchically structured data (hierarchical data in short) in a tabular structure. HTML tables do not come with a rigid schema and almost any forms of two-dimensional tables are acceptable according to the HTML grammar. This relaxation complicates the process of retrieving hierarchical data from HTML tables. In this paper, we propose an automated approach for retrieving hierarchical data from HTML tables. The proposed approach constructs the content tree of an HTML table, which captures the intended hierarchy of the data content of the table, without requiring the internal structure of the table to be known beforehand. Also, the user of the content tree does not deal with HTML tags while retrieving the desired data from the content tree. Our approach can be employed by (i) a query language written for retrieving hierarchically structured data, extracted from either the contents of HTML tables or other sources, (ii) a processor for converting HTML tables to XML documents, and (iii) a data warehousing repository for collecting hierarchical data from HTML tables and storing materialized views of the tables. The time complexity of the proposed retrieval approach is proportional to the number of HTML elements in an HTML table.