Automatic composite wrapper generation for semi-structured biological data based on table structure identification

Authors:
Liangyou Chen;Hasan M. Jamil;Nan Wang
Affiliations:
Mississippi State University;Wayne State University;Mississippi State University
Venue:
ACM SIGMOD Record
Year:
2004

Citing 9
Cited 4

Record-boundary discovery in Web documents

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Conceptual-model-based data extraction from multiple-record Web pages

Data & Knowledge Engineering
AQR-toolkit: an adaptive query routing middleware for distributed data intensive systems

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
DNA-miner: a system prototype for mining DNA sequences

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Towards a visual query interface for phylogenetic databases

Proceedings of the tenth international conference on Information and knowledge management
A brief survey of web data extraction tools

ACM SIGMOD Record
Supporting Remote User Defined Functions in Heterogeneous Biological Databases

BIBE '01 Proceedings of the 2nd IEEE International Symposium on Bioinformatics and Bioengineering
Streamlining Biological Data Analysis Using BioFlow

BIBE '03 Proceedings of the 3rd IEEE Symposium on BioInformatics and BioEngineering
OntoBuilder: Fully Automatic Extraction and Consolidation of Ontologies from Web Sources

ICDE '04 Proceedings of the 20th International Conference on Data Engineering

Fully automatic wrapper generation for search engines

WWW '05 Proceedings of the 14th international conference on World Wide Web
A methodology for semantic integration of metadata in bioinformatics data sources

Proceedings of the 43rd annual Southeast regional conference - Volume 1
ODE: Ontology-assisted data extraction

ACM Transactions on Database Systems (TODS)
IRobotAssist: hosting automated agents for assistive web browsing

Telehealth/AT '08 Proceedings of the IASTED International Conference on Telehealth/Assistive Technologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

Biological data analyses usually require complex manipulations involving tool applications, multiple web site navigation, result selection and filtering, and iteration over the internet. Most biological data are generated from structured databases and by applications and presented to the users embedded within repeated structures, or tables, in HTML documents. In this paper we outline a novel technique for the identification of table structures in HTML documents. This identification technique is then used to automatically generate composite wrappers for applications requiring distributed resources. We demonstrate that our method is robust enough to discover standard as well as non-standard table structures in HTML documents. Thus our technique outperforms contemporary techniques used in systems such as XWrap and AutoWrapper. We discuss our technique in the context of our PickUp system that exploits the theoretical developments presented in this paper and emerges as an elegant automatic wrapper generation system.