Improving the performance of focused web crawlers

Authors:
Sotiris Batsakis;Euripides G. M. Petrakis;Evangelos Milios
Affiliations:
Department of Electronic and Computer Engineering, Technical University of Crete (TUC), Chania, Crete GR-73100, Greece;Department of Electronic and Computer Engineering, Technical University of Crete (TUC), Chania, Crete GR-73100, Greece;Faculty of Computer Science, Dalhousie University, Halifax, Nova Scotia, Canada B3H 1W5
Venue:
Data & Knowledge Engineering
Year:
2009

Citing 29
Cited 12

The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
The shark-search algorithm. An application: tailored Web site mapping

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
A vector space model for automatic indexing

Communications of the ACM
Intelligent crawling on the World Wide Web with arbitrary predicates

Proceedings of the 10th international conference on World Wide Web
Accelerated focused crawling through online relevance feedback

Proceedings of the 11th international conference on World Wide Web
Collection synthesis

Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries
Focused Crawling Using Context Graphs

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Focused Crawls, Tunneling, and Digital Libraries

ECDL '02 Proceedings of the 6th European Conference on Research and Advanced Technology for Digital Libraries
Crawling the web: discovery and maintenance of large-scale web data

Crawling the web: discovery and maintenance of large-scale web data
Finding Buying Guides with a Web Carnivore

LA-WEB '03 Proceedings of the First Conference on Latin American Web Congress
Ontology-focused crawling of Web documents

Proceedings of the 2003 ACM symposium on Applied computing
Panorama: extending digital libraries with topical crawlers

Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries
Topical web crawlers: Evaluating adaptive algorithms

ACM Transactions on Internet Technology (TOIT)
Probabilistic models for focused web crawling

Proceedings of the 6th annual ACM international workshop on Web information and data management
Focused crawling by exploiting anchor text using decision tree

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Learning to crawl: Comparing classification schemes

ACM Transactions on Information Systems (TOIS)
Semantic similarity methods in wordNet and their application to information retrieval on the web

Proceedings of the 7th annual ACM international workshop on Web information and data management
Link Contexts in Classifier-Guided Topical Crawlers

IEEE Transactions on Knowledge and Data Engineering
Focused crawling: experiences in a real world project

Proceedings of the 15th international conference on World Wide Web
A Survey of Web Information Extraction Systems

IEEE Transactions on Knowledge and Data Engineering
Transforming arbitrary tables into logical form with TARTAR

Data & Knowledge Engineering
Using HMM to learn user browsing patterns for focused web crawling

Data & Knowledge Engineering - Special issue: WIDM 2004
First-order focused crawling

Proceedings of the 16th international conference on World Wide Web
Agreeing to disagree: search engines and their public interfaces

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
A novel hybrid focused crawling algorithm to build domain-specific collections

A novel hybrid focused crawling algorithm to build domain-specific collections
Competitor Mining with the Web

IEEE Transactions on Knowledge and Data Engineering
Measuring the semantic similarity of texts

EMSEE '05 Proceedings of the ACL Workshop on Empirical Modeling of Semantic Equivalence and Entailment
Nearest neighbor pattern classification

IEEE Transactions on Information Theory

Application of structured document parsing to focused web crawling

Computer Standards & Interfaces
The SHARC framework for data quality in Web archiving

The VLDB Journal — The International Journal on Very Large Data Bases
A constrained crawling approach and its application to a specialised search engine

International Journal of Information and Communication Technology
Searching and browsing Linked Data with SWSE: The Semantic Web Search Engine

Web Semantics: Science, Services and Agents on the World Wide Web
Intelligent web navigation

FDIA'09 Proceedings of the Third BCS-IRSG conference on Future Directions in Information Access
A novel focused crawler based on breadcrumb navigation

ICSI'12 Proceedings of the Third international conference on Advances in Swarm Intelligence - Volume Part II
Confidence-Based incremental classification for objects with limited attributes in vertical search

IEA/AIE'12 Proceedings of the 25th international conference on Industrial Engineering and Other Applications of Applied Intelligent Systems: advanced research in applied artificial intelligence
Semantic ranking of web pages based on formal concept analysis

Journal of Systems and Software
A Generalized Links and Text Properties Based Forum Crawler

WI-IAT '12 Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
Focused crawling of tagged web resources using ontology

Computers and Electrical Engineering
An analyst-adaptive approach to focused crawlers

Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
Editorial: A topic-specific crawling strategy based on semantics similarity

Data & Knowledge Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

This work addresses issues related to the design and implementation of focused crawlers. Several variants of state-of-the-art crawlers relying on web page content and link information for estimating the relevance of web pages to a given topic are proposed. Particular emphasis is given to crawlers capable of learning not only the content of relevant pages (as classic crawlers do) but also paths leading to relevant pages. A novel learning crawler inspired by a previously proposed Hidden Markov Model (HMM) crawler is described as well. The crawlers have been implemented using the same baseline implementation (only the priority assignment function differs in each crawler) providing an unbiased evaluation framework for a comparative analysis of their performance. All crawlers achieve their maximum performance when a combination of web page content and (link) anchor text is used for assigning download priorities to web pages. Furthermore, the new HMM crawler improved the performance of the original HMM crawler and also outperforms classic focused crawlers in searching for specialized topics.