Crawling Deep Web Using a New Set Covering Algorithm

Authors:
Yan Wang;Jianguo Lu;Jessica Chen
Affiliations:
School of Computer Science, University of Windsor, Windsor, Canada N9B 3P4;School of Computer Science, University of Windsor, Windsor, Canada N9B 3P4 and Key Lab of Novel Software Technology, Nanjing, China;School of Computer Science, University of Windsor, Windsor, Canada N9B 3P4
Venue:
ADMA '09 Proceedings of the 5th International Conference on Advanced Data Mining and Applications
Year:
2009

Citing 11
Cited 5

Query-based sampling of text databases

ACM Transactions on Information Systems (TOIS)
Probe, Cluster, and Discover: Focused Extraction of QA-Pagelets from the Deep Web

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Lucene in Action (In Action series)

Lucene in Action (In Action series)
Downloading textual hidden web content through keyword queries

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Query Selection Techniques for Efficient Crawling of Structured Web Sources

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
A Survey of Web Information Extraction Systems

IEEE Transactions on Knowledge and Data Engineering
Efficient, automatic web resource harvesting

WIDM '06 Proceedings of the 8th annual ACM international workshop on Web information and data management
Towards a query optimizer for text-centric tasks

ACM Transactions on Database Systems (TODS)
Extracting lists of data records from semi-structured web pages

Data & Knowledge Engineering
Addressing Effective Hidden Web Search Using Iterative Deepening Search and Graph Theory

CITWORKSHOPS '08 Proceedings of the 2008 IEEE 8th International Conference on Computer and Information Technology Workshops
An Approach to Deep Web Crawling by Sampling

WI-IAT '08 Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 01

Incremental structured web database crawling via history versions

WISE'10 Proceedings of the 11th international conference on Web information systems engineering
Topic-Sensitive hidden-web crawling

WISE'12 Proceedings of the 13th international conference on Web Information Systems Engineering
E-FFC: an enhanced form-focused crawler for domain-specific deep web databases

Journal of Intelligent Information Systems
Crawling deep web entity pages

Proceedings of the sixth ACM international conference on Web search and data mining
Formal concept analysis approach for data extraction from a limited deep web database

Journal of Intelligent Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Crawling the deep web often requires the selection of an appropriate set of queries so that they can cover most of the documents in the data source with low cost. This can be modeled as a set covering problem which has been extensively studied. The conventional set covering algorithms, however, do not work well when applied to deep web crawling due to various special features of this application domain. Typically, most set covering algorithms assume the uniform distribution of the elements being covered, while for deep web crawling, neither the sizes of documents nor the document frequencies of the queries is distributed uniformly. Instead, they follow the power law distribution. Hence, we have developed a new set covering algorithm that targets at web crawling. Compared to our previous deep web crawling method that uses a straightforward greedy set covering algorithm, it introduces weights into the greedy strategy. Our experiment carried out on a variety of corpora shows that this new method consistently outperforms its un-weighted version.