Query routing for Web search engines: architectures and experiments
Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Query-based sampling of text databases
ACM Transactions on Information Systems (TOIS)
Automatic information extraction from web pages
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Introduction to Modern Information Retrieval
Introduction to Modern Information Retrieval
QProber: A system for automatic classification of hidden-Web databases
ACM Transactions on Information Systems (TOIS)
Proceedings of the 27th International Conference on Very Large Data Bases
RoadRunner: Towards Automatic Data Extraction from Large Web Sites
Proceedings of the 27th International Conference on Very Large Data Bases
On the Automatic Extraction of Data from the Hidden Web
Revised Papers from the HUMACS, DASWIS, ECOMO, and DAMA on ER 2001 Workshops
Extracting structured data from Web pages
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Automatic Information Discovery from the "Invisible Web"
ITCC '02 Proceedings of the International Conference on Information Technology: Coding and Computing
Automatic generation of agents for collecting hidden web pages for data extraction
Data & Knowledge Engineering - Special issue: WIDM 2002
Query-related data extraction of hidden web documents
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Sampling, information extraction and summarisation of hidden web databases
Data & Knowledge Engineering - Special issue: WIDM 2004
Federated text retrieval from uncooperative overlapped collections
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
CCReSD: concept-based categorisation of Hidden Web databases
International Journal of High Performance Computing and Networking
Robust result merging using sample-based score estimates
ACM Transactions on Information Systems (TOIS)
Privacy preservation of aggregates in hidden databases: why and how?
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Turbo-charging hidden database samplers with overflowing queries and skew reduction
Proceedings of the 13th International Conference on Extending Database Technology
Unbiased estimation of size and other aggregates over hidden web databases
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Just-in-time analytics on large file systems
FAST'11 Proceedings of the 9th USENIX conference on File and stroage technologies
Foundations and Trends in Information Retrieval
A TNATS approach to hidden web documents
ICDCIT'04 Proceedings of the First international conference on Distributed Computing and Internet Technology
Hi-index | 0.00 |
Hidden Web databases maintain a collection of specialised documents, which are dynamically generated in response to users' queries. However, the documents are generated by Web page templates, which contain information that is irrelevant to queries. This paper presents a Two-Phase Sampling (2PS) technique that detects templates and extracts query-related information from the sampled documents of a database. In the first phase, 2PS queries databases with terms contained in their search interface pages and the subsequently sampled documents. This process retrieves a required number of documents. In the second phase, 2PS detects Web page templates in the sampled documents in order to extract information relevant to queries. We test 2PS on a number of real-world Hidden Web databases. Experimental results demonstrate that 2PS effectively eliminates irrelevant information contained in Web page templates and generates terms and frequencies with improved accuracy.