A survey of approximately optimal solutions to some covering and packing problems
ACM Computing Surveys (CSUR)
On the hardness of approximating minimization problems
Journal of the ACM (JACM)
Approximating block accesses in database organizations
Communications of the ACM
Query-based sampling of text databases
ACM Transactions on Information Systems (TOIS)
Introduction to algorithms
Proceedings of the 27th International Conference on Very Large Data Bases
RoadRunner: Towards Automatic Data Extraction from Large Web Sites
Proceedings of the 27th International Conference on Very Large Data Bases
Communications of the ACM - E-services: a cornucopia of digital offerings ushers in the next Net-based evolution
Probe, Cluster, and Discover: Focused Extraction of QA-Pagelets from the Deep Web
ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Automatic generation of agents for collecting hidden web pages for data extraction
Data & Knowledge Engineering - Special issue: WIDM 2002
Word classification and hierarchy using co-occurrence word information
Information Processing and Management: an International Journal
Lucene in Action (In Action series)
Lucene in Action (In Action series)
Data & Knowledge Engineering
Downloading textual hidden web content through keyword queries
Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Query Selection Techniques for Efficient Crawling of Structured Web Sources
ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Random sampling from a search engine's index
Proceedings of the 15th international conference on World Wide Web
To search or to crawl?: towards a query optimizer for text-centric tasks
Proceedings of the 2006 ACM SIGMOD international conference on Management of data
A Survey of Web Information Extraction Systems
IEEE Transactions on Knowledge and Data Engineering
Efficient, automatic web resource harvesting
WIDM '06 Proceedings of the 8th annual ACM international workshop on Web information and data management
Lazy preservation: reconstructing websites by crawling the crawlers
WIDM '06 Proceedings of the 8th annual ACM international workshop on Web information and data management
Communications of the ACM - ACM at sixty: a look back in time
An adaptive crawler for locating hidden-Web entry points
Proceedings of the 16th international conference on World Wide Web
Extracting lists of data records from semi-structured web pages
Data & Knowledge Engineering
Proceedings of the VLDB Endowment
Efficient estimation of the size of text deep web data source
Proceedings of the 17th ACM conference on Information and knowledge management
An Approach to Deep Web Crawling by Sampling
WI-IAT '08 Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Estimating deep web data source size by capture---recapture method
Information Retrieval
Estimating the size and evolution of categorised topics in web directories
Web Intelligence and Agent Systems
Foundations and Trends in Information Retrieval
Crawling the content hidden behind web forms
ICCSA'07 Proceedings of the 2007 international conference on Computational science and Its applications - Volume Part II
Ranking bias in deep web size estimation using capture recapture method
Data & Knowledge Engineering
Understanding deep web search interfaces: a survey
ACM SIGMOD Record
Efficient deep web crawling using reinforcement learning
PAKDD'10 Proceedings of the 14th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I
Hi-index | 0.00 |
This paper studies the problem of selecting queries to efficiently crawl a deep web data source using a set of sample documents. Crawling deep web is the process of collecting data from search interfaces by issuing queries. One of the major challenges in crawling deep web is the selection of the queries so that most of the data can be retrieved at a low cost. We propose to learn a set of queries from a sample of the data source. To verify that the queries selected from a sample also produce a good result for the entire data source, we carried out a set of experiments on large corpora including Gov2, newsgroups, wikipedia and Reuters. We show that our sampling-based method is effective by empirically proving that 1 The queries selected from samples can harvest most of the data in the original database; 2 The queries with low overlapping rate in samples will also result in a low overlapping rate in the original database; and 3 The size of the sample and the size of the terms from where to select the queries do not need to be very large. Compared with other query selection methods, our method obtains the queries by analyzing a small set of sample documents, instead of learning the next best query incrementally from all the documents matched with previous queries.