An Approach to Deep Web Crawling by Sampling

Authors:
Jianguo Lu;Yan Wang;Jie Liang;Jessica Chen;Jiming Liu
Affiliations:
-;-;-;-;-
Venue:
WI-IAT '08 Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Year:
2008

Citing 11
Cited 12

Query-based sampling of text databases

ACM Transactions on Information Systems (TOIS)
Crawling the Hidden Web

Proceedings of the 27th International Conference on Very Large Data Bases
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
What are Web services?

Communications of the ACM - E-services: a cornucopia of digital offerings ushers in the next Net-based evolution
Automatic generation of agents for collecting hidden web pages for data extraction

Data & Knowledge Engineering - Special issue: WIDM 2002
Lucene in Action (In Action series)

Lucene in Action (In Action series)
DEQUE: querying the deep web

Data & Knowledge Engineering
Downloading textual hidden web content through keyword queries

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Query Selection Techniques for Efficient Crawling of Structured Web Sources

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
A Survey of Web Information Extraction Systems

IEEE Transactions on Knowledge and Data Engineering
Accessing the deep web

Communications of the ACM - ACM at sixty: a look back in time

Crawling Deep Web Using a New Set Covering Algorithm

ADMA '09 Proceedings of the 5th International Conference on Advanced Data Mining and Applications
Estimating deep web data source size by capture---recapture method

Information Retrieval
Optimizing content freshness of relations extracted from the web using keyword search

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Ranking bias in deep web size estimation using capture recapture method

Data & Knowledge Engineering
Federated Search

Foundations and Trends in Information Retrieval
A QIIIEP based domain specific hidden web crawler

Proceedings of the International Conference & Workshop on Emerging Trends in Technology
Incremental structured web database crawling via history versions

WISE'10 Proceedings of the 11th international conference on Web information systems engineering
Efficient deep web crawling using reinforcement learning

PAKDD'10 Proceedings of the 14th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I
A Novel Architecture for Deep Web Crawler

International Journal of Information Technology and Web Engineering
A Generalized Links and Text Properties Based Forum Crawler

WI-IAT '12 Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
A brief history of web crawlers

CASCON '13 Proceedings of the 2013 Conference of the Center for Advanced Studies on Collaborative Research
Selecting queries from sample to crawl deep web data sources

Web Intelligence and Agent Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Crawling deep web is the process of collecting data from search interfaces by issuing queries. With wide availability of programmable interface encoded in web services, deep web crawling has received a large variety of applications. One of the major challenges crawling deep web is the selection of the queries so that most of the data can be retrieved at a low cost. We propose a general method in this regard. In order to minimize the duplicates retrieved, we reduced the problem of selecting an optimal set of queries from a sample of the data source into the well-known set-covering problem and adopt a classical algorithm to resolve it. To verify that the queries selected from a sample also produce a good result for the entire data source, we carried out a set of experiments on large corpora including Wikipedia and Reuters. We show that our sampling-based method is effective by empirically proving that 1) The queries selected from samples can harvest most of the data in the original database; 2) The queries with low overlapping rate in samples will also result in a low overlapping rate in the original database; and 3) The size of the sample and the size of the terms from where to select the queries do not need to be very large.