Sampling, information extraction and summarisation of hidden web databases

Authors:
Yih-Ling Hedley;Muhammad Younas;Anne James;Mark Sanderson
Affiliations:
School of Mathematical and Information Sciences, Coventry University, Coventry, United Kingdom;Department of Computing, Oxford Brookes University, Oxford, United Kingdom;School of Mathematical and Information Sciences, Coventry University, Coventry, United Kingdom;Department of Information Studies, University of Sheffield, Sheffield, United Kingdom
Venue:
Data & Knowledge Engineering - Special issue: WIDM 2004
Year:
2006

Citing 20
Cited 8

Query routing for Web search engines: architectures and experiments

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Query-based sampling of text databases

ACM Transactions on Information Systems (TOIS)
Automatic information extraction from web pages

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Template detection via data mining and its applications

Proceedings of the 11th international conference on World Wide Web
Information Retrieval

Information Retrieval
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
QProber: A system for automatic classification of hidden-Web databases

ACM Transactions on Information Systems (TOIS)
Crawling the Hidden Web

Proceedings of the 27th International Conference on Very Large Data Bases
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
On the Automatic Extraction of Data from the Hidden Web

Revised Papers from the HUMACS, DASWIS, ECOMO, and DAMA on ER 2001 Workshops
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Automatic Information Discovery from the "Invisible Web"

ITCC '02 Proceedings of the International Conference on Information Technology: Coding and Computing
Learning block importance models for web pages

Proceedings of the 13th international conference on World Wide Web
Automatic detection of fragments in dynamically generated web pages

Proceedings of the 13th international conference on World Wide Web
Automatic generation of agents for collecting hidden web pages for data extraction

Data & Knowledge Engineering - Special issue: WIDM 2002
Block-based web search

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
A two-phase sampling technique for information extraction from hidden web databases

Proceedings of the 6th annual ACM international workshop on Web information and data management
Automatic extraction of informative blocks from webpages

Proceedings of the 2005 ACM symposium on Applied computing
Automatic Fragment Detection in Dynamic Web Pages and Its Impact on Caching

IEEE Transactions on Knowledge and Data Engineering
A TNATS approach to hidden web documents

ICDCIT'04 Proceedings of the First international conference on Distributed Computing and Internet Technology

An unsupervised method for joint information extraction and feature mining across different Web sites

Data & Knowledge Engineering
Privacy preservation of aggregates in hidden databases: why and how?

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Turbo-charging hidden database samplers with overflowing queries and skew reduction

Proceedings of the 13th International Conference on Extending Database Technology
Unbiased estimation of size and other aggregates over hidden web databases

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
An analytical model for delivery evaluation of multimodal contents in pervasive computing

Computers in Industry
Just-in-time analytics on large file systems

FAST'11 Proceedings of the 9th USENIX conference on File and stroage technologies
Attribute domain discovery for hidden web databases

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Rank discovery from web databases

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

Hidden Web databases maintain a collection of specialised documents, which are dynamically generated using page templates. This paper presents the Two-Phase Sampling (2PS) technique that detects and extracts query-related information from documents contained in databases. 2PS is based on a two-phase framework for the sampling, information extraction and summarisation of Hidden Web documents. In the first phase, 2PS samples and stores documents for further analysis. In the second phase, it detects Web page templates from sampled documents and extracts relevant information from which a content summary is then generated. Experimental results demonstrate that 2PS effectively eliminates irrelevant information from sampled documents and generates terms and frequencies with improved accuracy.