Automatic resource compilation by analyzing hyperlink structure and associated text
WWW7 Proceedings of the seventh international conference on World Wide Web 7
The anatomy of a large-scale hypertextual Web search engine
WWW7 Proceedings of the seventh international conference on World Wide Web 7
The shark-search algorithm. An application: tailored Web site mapping
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Focused crawling: a new approach to topic-specific Web resource discovery
WWW '99 Proceedings of the eighth international conference on World Wide Web
SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Evaluating topic-driven web crawlers
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Effective site finding using link anchor information
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Accelerated focused crawling through online relevance feedback
Proceedings of the 11th international conference on World Wide Web
MySpiders: Evolve Your Own Intelligent Web Crawlers
Autonomous Agents and Multi-Agent Systems
Focused Crawling Using Context Graphs
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Deriving link-context from HTML tag tree
DMKD '03 Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
A framework to derive web page context from hyperlink structure
International Journal of Information and Communication Technology
Towards automatic assessment of government web sites
Proceedings of the 3rd International Conference on Web Intelligence, Mining and Semantics
Hi-index | 0.00 |
In designing a focused crawler, the choice of strategy for prioritizing unvisited URLs is crucial. There is an intuition that the text surrounding a link or the link context on the HMTL page is a good summary of the target page. But little work has been done to utilize the beneficial link context information about the seed URLs before actual crawling. Motivated by the two observations, we propose a method to collect this kind of resources beforehand and then use it to guide the actual crawling. Experiments show that the proposed approach is reasonable and especially effective to a single-topic crawling, especially at the initial stage.