Query-related data extraction of hidden web documents

  • Authors:
  • Y. L. Hedley;M. Younas;A. James;M. Sanderson

  • Affiliations:
  • Coventry University, Coventry, UK;Coventry University, Coventry, UK;Coventry University, Coventry, UK;University of Sheffield, Sheffield, UK

  • Venue:
  • Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

The larger amount of information on the Web is stored in document databases and is not indexed by general-purpose search engines (i.e., Google and Yahoo). Such information is dynamically generated through querying databases - which are referred to as Hidden Web databases. Documents returned in response to a user query are typically presented using template-generated Web pages. This paper proposes a novel approach that identifies Web page templates by analysing the textual contents and the adjacent tag structures of a document in order to extract query-related data. Preliminary results demonstrate that our approach effectively detects templates and retrieves data with high recall and precision.