Exploiting genre in focused crawling

Authors:
Guilherme T. De Assis;Alberto H. F Laender;Marcos André Gonçalves;Altigran S. Da Silva
Affiliations:
Computer Science Department, Federal University of Minas Gerais, Belo Horizonte, MG, Brazil;Computer Science Department, Federal University of Minas Gerais, Belo Horizonte, MG, Brazil;Computer Science Department, Federal University of Minas Gerais, Belo Horizonte, MG, Brazil;Computer Science Department, Federal University of Amazonas, Manaus, AM, Brazil
Venue:
SPIRE'07 Proceedings of the 14th international conference on String processing and information retrieval
Year:
2007

Citing 14
Cited 8

Information retrieval in the World-Wide Web: making client-based searching feasible

Selected papers of the first conference on World-Wide Web
The shark-search algorithm. An application: tailored Web site mapping

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
Evaluating topic-driven web crawlers

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Modern Information Retrieval

Modern Information Retrieval
Automating the Construction of Internet Portals with Machine Learning

Information Retrieval
Focused Crawling Using Context Graphs

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Panorama: extending digital libraries with topical crawlers

Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries
Automatic generation of agents for collecting hidden web pages for data extraction

Data & Knowledge Engineering - Special issue: WIDM 2002
Topical web crawlers: Evaluating adaptive algorithms

ACM Transactions on Internet Technology (TOIT)
A General Evaluation Framework for Topical Crawlers

Information Retrieval
Learning to crawl: Comparing classification schemes

ACM Transactions on Information Systems (TOIS)
Link Contexts in Classifier-Guided Topical Crawlers

IEEE Transactions on Knowledge and Data Engineering
Using HMM to learn user browsing patterns for focused web crawling

Data & Knowledge Engineering - Special issue: WIDM 2004

The impact of term selection in genre-aware focused crawling

Proceedings of the 2008 ACM symposium on Applied computing
Development of a National Syllabus Repository for Higher Education in Ireland

ECDL '08 Proceedings of the 12th European conference on Research and Advanced Technology for Digital Libraries
A Genre-Aware Approach to Focused Crawling

World Wide Web
Focused browsing: providing topical feedback for link selection in hypertext browsing

ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval
Kairos: proactive harvesting of research paper metadata from scientific conference web sites

ICADL'10 Proceedings of the role of digital libraries in a time of global change, and 12th international conference on Asia-Pacific digital libraries
A conceptual framework for efficient web crawling in virtual integration contexts

WISM'11 Proceedings of the 2011 international conference on Web information systems and mining - Volume Part II
A tool for link-based web page classification

CAEPIA'11 Proceedings of the 14th international conference on Advances in artificial intelligence: spanish association for artificial intelligence
Intelligent web navigation

FDIA'09 Proceedings of the Third BCS-IRSG conference on Future Directions in Information Access

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we propose a novel approach to focused crawling that exploits genre and content-related information present in Web pages to guide the crawling process. The effectiveness, efficiency and scalability of this approach are demonstrated by a set of experiments involving the crawling of pages related to syllabi (genre) of computer science courses (content). The results of these experiments show that focused crawlers constructed according to our approach achieve levels of F1 superior to 92% (an average gain of 178% over traditional focused crawlers), requiring the analysis of no more than 60% of the visited pages in order to find 90% of the relevant pages (an average gain of 82% over traditional focused crawlers).