Focused crawling guided by link context

Authors:
Jingru Dong;Wanli Zuo;Tao Peng
Affiliations:
Collegae of Computer Science and Technology, Jilin University, Changchun, P.R. China;Collegae of Computer Science and Technology, Jilin University, Changchun, P.R. China;Collegae of Computer Science and Technology, Jilin University, Changchun, P.R. China
Venue:
AIA'06 Proceedings of the 24th IASTED international conference on Artificial intelligence and applications
Year:
2006

Citing 11
Cited 2

Automatic resource compilation by analyzing hyperlink structure and associated text

WWW7 Proceedings of the seventh international conference on World Wide Web 7
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
The shark-search algorithm. An application: tailored Web site mapping

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
Topical locality in the Web

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Evaluating topic-driven web crawlers

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Effective site finding using link anchor information

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Accelerated focused crawling through online relevance feedback

Proceedings of the 11th international conference on World Wide Web
MySpiders: Evolve Your Own Intelligent Web Crawlers

Autonomous Agents and Multi-Agent Systems
Focused Crawling Using Context Graphs

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Deriving link-context from HTML tag tree

DMKD '03 Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery

A framework to derive web page context from hyperlink structure

International Journal of Information and Communication Technology
Towards automatic assessment of government web sites

Proceedings of the 3rd International Conference on Web Intelligence, Mining and Semantics

Quantified Score

Hi-index	0.00

Visualization

Abstract

In designing a focused crawler, the choice of strategy for prioritizing unvisited URLs is crucial. There is an intuition that the text surrounding a link or the link context on the HMTL page is a good summary of the target page. But little work has been done to utilize the beneficial link context information about the seed URLs before actual crawling. Motivated by the two observations, we propose a method to collect this kind of resources beforehand and then use it to guide the actual crawling. Experiments show that the proposed approach is reasonable and especially effective to a single-topic crawling, especially at the initial stage.