A cross-language focused crawling algorithm based on multiple relevance prediction strategies

Authors:
Zhumin Chen;Jun Ma;Jingsheng Lei;Bo Yuan;Li Lian;Ling Song
Affiliations:
School of Computer Science and Technology, Shandong University, Jinan 250061, China;School of Computer Science and Technology, Shandong University, Jinan 250061, China;College of Information Science and Technology, Hainan University, Haikou 570228, China;Department of Computer Science, University of Southern California Los Angeles, CA 90088, USA;School of Computer Science and Technology, Shandong University, Jinan 250061, China;School of Computer Science and Technology, Shandong University, Jinan 250061, China and School of Computer Science and Technology, Shandong Jianzhu University, Jinan 250101, China
Venue:
Computers & Mathematics with Applications
Year:
2009

Citing 21
Cited 1

Efficient crawling through URL ordering

WWW7 Proceedings of the seventh international conference on World Wide Web 7
The shark-search algorithm. An application: tailored Web site mapping

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
Topical locality in the Web

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Intelligent crawling on the World Wide Web with arbitrary predicates

Proceedings of the 10th international conference on World Wide Web
Breadth-first crawling yields high-quality pages

Proceedings of the 10th international conference on World Wide Web
Evaluating topic-driven web crawlers

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Accelerated focused crawling through online relevance feedback

Proceedings of the 11th international conference on World Wide Web
The structure of broad topics on the web

Proceedings of the 11th international conference on World Wide Web
Topic-sensitive PageRank

Proceedings of the 11th international conference on World Wide Web
Modern Information Retrieval

Modern Information Retrieval
Focused Crawling Using Context Graphs

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Comparison of Three Vertical Search Spiders

Computer
Persona: A Contextualized and Personalized Web Search

HICSS '02 Proceedings of the 35th Annual Hawaii International Conference on System Sciences (HICSS'02)-Volume 3 - Volume 3
Ontology-focused crawling of Web documents

Proceedings of the 2003 ACM symposium on Applied computing
Topical web crawlers: Evaluating adaptive algorithms

ACM Transactions on Internet Technology (TOIT)
A General Evaluation Framework for Topical Crawlers

Information Retrieval
Using ODP metadata to personalize search

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Geographically focused collaborative crawling

Proceedings of the 15th international conference on World Wide Web
A Method for Focused Crawling Using Combination of Link Structure and Content Similarity

WI '06 Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence
An effective relevance prediction algorithm based on hierarchical taxonomy for focused crawling

AIRS'08 Proceedings of the 4th Asia information retrieval conference on Information retrieval technology

PROBABILISTIC MODELS FOR FOCUSED WEB CRAWLING

Computational Intelligence

Quantified Score

Hi-index	0.09

Visualization

Abstract

Focused crawling is increasingly seen as a solution to address the scalability limitations of existing general-purpose search engines, by traversing the Web to only gather pages that are relevant to a specific topic. How to predict the relevance of the unvisited pages pointed to by candidate URLs in the crawling frontier to a given topic is a key issue in the design of focused crawlers. In this paper, we propose a novel approach based on multiple relevance prediction strategies to address this problem. For cross-language crawling, we first introduce a hierarchical taxonomy to describe topics in both English and Chinese. We then present a formal description of the relevance predicting process and discuss four strategies that make use of page contents, anchor texts, URL addresses and link types of Web pages, respectively, to evaluate the relevance more accurately, in which we propose a particular strategy using Chinese URL addresses to estimate the relevance of cross-language Web pages. Finally, we get a new focused crawling algorithm (FCMRPS, Focused Crawling based on Multiple Relevance Prediction Strategies) based on the combination of these strategies and Shark-Search, which is a classic focused crawling algorithm. Experiments show that the FCMRPS is more effective than the traditional algorithms, namely Breadth-First, Best-First and Shark-Search, in terms of precision and sum of information.