Topic-specific crawling on the Web with the measurements of the relevancy context graph

Authors:
Ching-Chi Hsu;Fan Wu
Affiliations:
Institution for Information Industry, Taipei, 101, Taiwan;Department of Information Management, National Chung-Cheng University, Chia-Yi, 621, Taiwan
Venue:
Information Systems
Year:
2006

Citing 11
Cited 5

Inferring Web communities from link topology

Proceedings of the ninth ACM conference on Hypertext and hypermedia : links, objects, time and space---structure in hypermedia systems: links, objects, time and space---structure in hypermedia systems
Improved algorithms for topic distillation in a hyperlinked environment

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Efficient crawling through URL ordering

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Hypertext information retrieval for the Web

ACM SIGIR Forum
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
Authoritative sources in a hyperlinked environment

Journal of the ACM (JACM)
Topical locality in the Web

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Distributed Hypertext Resource Discovery Through Examples

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Focused Crawling Using Context Graphs

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases

A Topic-Specific Web Crawler with Concept Similarity Context Graph Based on FCA

ICIC '08 Proceedings of the 4th international conference on Intelligent Computing: Advanced Intelligent Computing Theories and Applications - with Aspects of Artificial Intelligence
Learnable focused crawling based on ontology

AIRS'08 Proceedings of the 4th Asia information retrieval conference on Information retrieval technology
Semantic ranking of web pages based on formal concept analysis

Journal of Systems and Software
Course-specific search engines: semi-automated methods for identifying high quality topic-specific corpora

Proceedings of the 22nd international conference on World Wide Web companion
Editorial: A topic-specific crawling strategy based on semantics similarity

Data & Knowledge Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

One of the major problems for automatically constructed portals and information discovery systems is how to assign proper order to unvisited web pages. Topic-specific crawlers and information seeking agents should try not to traverse the off-topic areas and concentrate on links that lead to documents of interest. In this paper, we propose an effective approach based on the relevancy context graph to solve this problem. The graph can estimate the distance and the relevancy degree between the retrieved document and the given topic. By calculating the word distributions of the general and topic-specific feature words, our method will preserve the property of the relevancy context graph and reflect it on the word distributions. With the help of topic-specific and general word distribution, our crawler can measure a page's expected relevancy to a given topic and determine the order in which pages should be visited first. Simulations are also performed, and the results show that our method outperforms than the breath-first and the method using only the context graph.