A novel hybrid focused crawling algorithm to build domain-specific collections

Authors:
Edward A. Fox;Yuxin Chen
Affiliations:
Virginia Polytechnic Institute and State University;Virginia Polytechnic Institute and State University
Venue:
A novel hybrid focused crawling algorithm to build domain-specific collections
Year:
2007

Citing 0
Cited 4

Metadata domain-knowledge driven search engine in "HyperManyMedia" E-learning resources

CSTST '08 Proceedings of the 5th international conference on Soft computing as transdisciplinary science and technology
Improving the performance of focused web crawlers

Data & Knowledge Engineering
Metadata as seeds for building an ontology driven information retrieval system

International Journal of Hybrid Intelligent Systems
A constrained crawling approach and its application to a specialised search engine

International Journal of Information and Communication Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

The Web, containing a large amount of useful information and resources, is expanding rapidly. Collecting domain-specific documents/information from the Web is one of the most important methods to build digital libraries for the scientific community. Focused Crawlers can selectively retrieve Web documents relevant to a specific domain to build collections for domain-specific search engines or digital libraries. Traditional focused crawlers normally adopting the simple Vector Space Model and local Web search algorithms typically only find relevant Web pages with low precision. Recall also often is low, since they explore a limited sub-graph of the Web that surrounds the starting URL set, and will ignore relevant pages outside this sub-graph. In this work, we investigated how to apply an inductive machine learning algorithm and meta-search technique, to the traditional focused crawling process, to overcome the above mentioned problems and to improve performance. We proposed a novel hybrid focused crawling framework based on Genetic Programming (GP) and meta-search. We showed that our novel hybrid framework can be applied to traditional focused crawlers to accurately find more relevant Web documents for the use of digital libraries and domain-specific search engines. The framework is validated through experiments performed on test documents from the Open Directory Project [22]. Our studies have shown that improvement can be achieved relative to the traditional focused crawler if genetic programming and meta-search methods are introduced into the focused crawling process.