Selecting queries from sample to crawl deep web data sources

Authors:
Yan Wang;Jianguo Lu;Jie Liang;Jessica Chen;Jiming Liu
Affiliations:
School of Computer Science, University of Windsor, Windsor, Ontario, Canada, E-mail: {jlu,wang16c,liangr,xjchen}@uwindsor.ca;School of Computer Science, University of Windsor, Windsor, Ontario, Canada, E-mail: {jlu,wang16c,liangr,xjchen}@uwindsor.ca and State Key Lab for Novel Software Technology, Nanjing University, Na ...;School of Computer Science, University of Windsor, Windsor, Ontario, Canada, E-mail: {jlu,wang16c,liangr,xjchen}@uwindsor.ca;School of Computer Science, University of Windsor, Windsor, Ontario, Canada, E-mail: {jlu,wang16c,liangr,xjchen}@uwindsor.ca;Department of Computer Science, Hong Kong Baptist University, Hong Kong, China, E-mail: jiming@comp.hkbu.edu.hk
Venue:
Web Intelligence and Agent Systems
Year:
2012

Citing 33
Cited 0

A survey of approximately optimal solutions to some covering and packing problems

ACM Computing Surveys (CSUR)
On the hardness of approximating minimization problems

Journal of the ACM (JACM)
Approximating block accesses in database organizations

Communications of the ACM
Query-based sampling of text databases

ACM Transactions on Information Systems (TOIS)
Introduction to algorithms

Introduction to algorithms
Crawling the Hidden Web

Proceedings of the 27th International Conference on Very Large Data Bases
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
What are Web services?

Communications of the ACM - E-services: a cornucopia of digital offerings ushers in the next Net-based evolution
Probe, Cluster, and Discover: Focused Extraction of QA-Pagelets from the Deep Web

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Automatic generation of agents for collecting hidden web pages for data extraction

Data & Knowledge Engineering - Special issue: WIDM 2002
Word classification and hierarchy using co-occurrence word information

Information Processing and Management: an International Journal
Lucene in Action (In Action series)

Lucene in Action (In Action series)
DEQUE: querying the deep web

Data & Knowledge Engineering
Downloading textual hidden web content through keyword queries

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Query Selection Techniques for Efficient Crawling of Structured Web Sources

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Random sampling from a search engine's index

Proceedings of the 15th international conference on World Wide Web
To search or to crawl?: towards a query optimizer for text-centric tasks

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
A Survey of Web Information Extraction Systems

IEEE Transactions on Knowledge and Data Engineering
Efficient, automatic web resource harvesting

WIDM '06 Proceedings of the 8th annual ACM international workshop on Web information and data management
Lazy preservation: reconstructing websites by crawling the crawlers

WIDM '06 Proceedings of the 8th annual ACM international workshop on Web information and data management
Accessing the deep web

Communications of the ACM - ACM at sixty: a look back in time
An adaptive crawler for locating hidden-Web entry points

Proceedings of the 16th international conference on World Wide Web
Extracting lists of data records from semi-structured web pages

Data & Knowledge Engineering
Google's Deep Web crawl

Proceedings of the VLDB Endowment
Efficient estimation of the size of text deep web data source

Proceedings of the 17th ACM conference on Information and knowledge management
An Approach to Deep Web Crawling by Sampling

WI-IAT '08 Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Estimating deep web data source size by capture---recapture method

Information Retrieval
Estimating the size and evolution of categorised topics in web directories

Web Intelligence and Agent Systems
Web Crawling

Foundations and Trends in Information Retrieval
Crawling the content hidden behind web forms

ICCSA'07 Proceedings of the 2007 international conference on Computational science and Its applications - Volume Part II
Ranking bias in deep web size estimation using capture recapture method

Data & Knowledge Engineering
Understanding deep web search interfaces: a survey

ACM SIGMOD Record
Efficient deep web crawling using reinforcement learning

PAKDD'10 Proceedings of the 14th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper studies the problem of selecting queries to efficiently crawl a deep web data source using a set of sample documents. Crawling deep web is the process of collecting data from search interfaces by issuing queries. One of the major challenges in crawling deep web is the selection of the queries so that most of the data can be retrieved at a low cost. We propose to learn a set of queries from a sample of the data source. To verify that the queries selected from a sample also produce a good result for the entire data source, we carried out a set of experiments on large corpora including Gov2, newsgroups, wikipedia and Reuters. We show that our sampling-based method is effective by empirically proving that 1 The queries selected from samples can harvest most of the data in the original database; 2 The queries with low overlapping rate in samples will also result in a low overlapping rate in the original database; and 3 The size of the sample and the size of the terms from where to select the queries do not need to be very large. Compared with other query selection methods, our method obtains the queries by analyzing a small set of sample documents, instead of learning the next best query incrementally from all the documents matched with previous queries.