Information retrieval in the World-Wide Web: making client-based searching feasible
Selected papers of the first conference on World-Wide Web
Focused crawling: a new approach to topic-specific Web resource discovery
WWW '99 Proceedings of the eighth international conference on World Wide Web
Evaluating topic-driven web crawlers
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Modern Information Retrieval
Automating the Construction of Internet Portals with Machine Learning
Information Retrieval
Focused Crawling Using Context Graphs
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Panorama: extending digital libraries with topical crawlers
Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries
Topical web crawlers: Evaluating adaptive algorithms
ACM Transactions on Internet Technology (TOIT)
A General Evaluation Framework for Topical Crawlers
Information Retrieval
Learning to crawl: Comparing classification schemes
ACM Transactions on Information Systems (TOIS)
Link Contexts in Classifier-Guided Topical Crawlers
IEEE Transactions on Knowledge and Data Engineering
Exploiting genre in focused crawling
SPIRE'07 Proceedings of the 14th international conference on String processing and information retrieval
PROBABILISTIC MODELS FOR FOCUSED WEB CRAWLING
Computational Intelligence
Hi-index | 0.00 |
The genre-aware approach to focused crawling aims at crawling pages related to specific topics that can be expressed in terms of both genre and content information. Such an approach requires an expert to specify a set of terms that describe the genre and the content of the pages of interest. In this paper, we analyze the impact of term selection on this approach. Thus, we have performed an experimental study in which we vary the number of genre and content terms used in focused crawling processes aimed at crawling pages related to syllabi (genre) of computer science courses (subject) and sale offers (genre) of computer equipments (subject). This experimental study showed that a small set of terms selected by an expert is usually enough to produce good results. In addition, we propose and experimentally evaluate a strategy for semi-automatic generation of terms to be used in such an approach. The results of these experiments showed that such a strategy is very effective and provides a means to assist an expert in the task of specifying the sets of required terms.