Automatic Wrapper Generation for Semi-Structured Biological Data Based on Table Structure Identification

  • Authors:
  • Liangyou Chen;Hasan M. Jamil;Nan Wang

  • Affiliations:
  • -;-;-

  • Venue:
  • DEXA '03 Proceedings of the 14th International Workshop on Database and Expert Systems Applications
  • Year:
  • 2003

Quantified Score

Hi-index 0.02

Visualization

Abstract

Biological data analyses usually require complex manipulationsinvolving tool applications, multiple web site navigation, result selectionand filtering, and iteration over the internet. Most biologicaldata are generated from structured databases and by applicationsand presented to the users embedded within repeated structures,or tables, in HTML documents. In this paper we outline a noveltechnique for the identification of table structures in HTML documents.This identification technique is then used to automaticallygenerate composite wrappers for applications requiring distributedresources. We demonstrate that our method is robust enough to discoverstandard as well as non-standard table structures in HTMLdocuments. Thus our technique outperforms contemporary techniquesused in systems such as XWrap and AutoWrapper. We discussour technique in the context of our PickUp system that exploitsthe theoretical developments presented in this paper and emergesas an elegant automatic wrapper generation system.