Accurate and efficient crawling for relevant websites

Authors:
Martin Ester;Hans-Peter Kriegel;Matthias Schubert
Affiliations:
Simon Fraser University, School of Computing Science, Burnaby BC, Canada;University of Munich, Institute for Computer Science, Munich, Germany;University of Munich, Institute for Computer Science, Munich, Germany
Venue:
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Year:
2004

Citing 13
Cited 9

Enhanced hypertext categorization using hyperlinks

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Improved algorithms for topic distillation in a hyperlinked environment

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Efficient crawling through URL ordering

WWW7 Proceedings of the seventh international conference on World Wide Web 7
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
Authoritative sources in a hyperlinked environment

Proceedings of the ninth annual ACM-SIAM symposium on Discrete algorithms
Learning to construct knowledge bases from the World Wide Web

Artificial Intelligence - Special issue on Intelligent internet systems
Accelerated focused crawling through online relevance feedback

Proceedings of the 11th international conference on World Wide Web
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Using Reinforcement Learning to Spider the Web Efficiently

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Distributed Hypertext Resource Discovery Through Examples

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Focused Crawling Using Context Graphs

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Web site mining: a new way to spot competitors, customers and suppliers in the world wide web

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining

Using ODP metadata to personalize search

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
RankMass crawler: a crawler with high personalized pagerank coverage guarantee

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
SESQ: A Model-Driven Method for Building Object Level Vertical Search Engines

ER '08 Proceedings of the 27th International Conference on Conceptual Modeling
Using structured tokens to identify webpages for data extraction

APWeb/WAIM'07 Proceedings of the joint 9th Asia-Pacific web and 8th international conference on web-age information management conference on Advances in data and web management
Discover hierarchical subgraphs with network-topology based ranking score

Proceedings of the Third C* Conference on Computer Science and Software Engineering
Where to crawl next for focused crawlers

KES'10 Proceedings of the 14th international conference on Knowledge-based and intelligent information and engineering systems: Part IV
Meta-search based web resource discovery for object-level vertical search

WISE'06 Proceedings of the 7th international conference on Web Information Systems
Topic-based website feature analysis for enterprise search from the web

WISE'06 Proceedings of the 7th international conference on Web Information Systems
E-FFC: an enhanced form-focused crawler for domain-specific deep web databases

Journal of Intelligent Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Focused web crawlers have recently emerged as an alternative to the well-established web search engines. While the well-known focused crawlers retrieve relevant webpages, there are various applications which target whole websites instead of single webpages. For example, companies are represented by websites, not by individual webpages. To answer queries targeted at websites, web directories are an established solution. In this paper, we introduce a novel focused website crawler to employ the paradigm of focused crawling for the search of relevant websites. The proposed crawler is based on a two-level architecture and corresponding crawl strategies with an explicit concept of websites. The external crawler views the web as a graph of linked websites, selects the websites to be examined next and invokes internal crawlers. Each internal crawler views the webpages of a single given website and performs focused (page) crawling within that website. Our experimental evaluation demonstrates that the proposed focused website crawler clearly outperforms previous methods of focused crawling which were adapted to retrieve websites instead of single webpages.