Query-related data extraction of hidden web documents

Authors:
Y. L. Hedley;M. Younas;A. James;M. Sanderson
Affiliations:
Coventry University, Coventry, UK;Coventry University, Coventry, UK;Coventry University, Coventry, UK;University of Sheffield, Sheffield, UK
Venue:
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Year:
2004

Citing 6
Cited 3

Query routing for Web search engines: architectures and experiments

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Query-based sampling of text databases

ACM Transactions on Information Systems (TOIS)
Automatic information extraction from web pages

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
QProber: A system for automatic classification of hidden-Web databases

ACM Transactions on Information Systems (TOIS)
Automatic Information Discovery from the "Invisible Web"

ITCC '02 Proceedings of the International Conference on Information Technology: Coding and Computing

A two-phase sampling technique for information extraction from hidden web databases

Proceedings of the 6th annual ACM international workshop on Web information and data management
Federated Search

Foundations and Trends in Information Retrieval
Online social network profile data extraction for vulnerability analysis

International Journal of Internet Technology and Secured Transactions

Quantified Score

Hi-index	0.00

Visualization

Abstract

The larger amount of information on the Web is stored in document databases and is not indexed by general-purpose search engines (i.e., Google and Yahoo). Such information is dynamically generated through querying databases - which are referred to as Hidden Web databases. Documents returned in response to a user query are typically presented using template-generated Web pages. This paper proposes a novel approach that identifies Web page templates by analysing the textual contents and the adjacent tag structures of a document in order to extract query-related data. Preliminary results demonstrate that our approach effectively detects templates and retrieves data with high recall and precision.