Automatic composite wrapper generation for semi-structured biological data based on table structure identification

  • Authors:
  • Liangyou Chen;Hasan M. Jamil;Nan Wang

  • Affiliations:
  • Mississippi State University;Wayne State University;Mississippi State University

  • Venue:
  • ACM SIGMOD Record
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

Biological data analyses usually require complex manipulations involving tool applications, multiple web site navigation, result selection and filtering, and iteration over the internet. Most biological data are generated from structured databases and by applications and presented to the users embedded within repeated structures, or tables, in HTML documents. In this paper we outline a novel technique for the identification of table structures in HTML documents. This identification technique is then used to automatically generate composite wrappers for applications requiring distributed resources. We demonstrate that our method is robust enough to discover standard as well as non-standard table structures in HTML documents. Thus our technique outperforms contemporary techniques used in systems such as XWrap and AutoWrapper. We discuss our technique in the context of our PickUp system that exploits the theoretical developments presented in this paper and emerges as an elegant automatic wrapper generation system.