Information retrieval in the World-Wide Web: making client-based searching feasible
Selected papers of the first conference on World-Wide Web
Efficient crawling through URL ordering
WWW7 Proceedings of the seventh international conference on World Wide Web 7
The shark-search algorithm. An application: tailored Web site mapping
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Authoritative sources in a hyperlinked environment
Proceedings of the ninth annual ACM-SIAM symposium on Discrete algorithms
Adaptive Retrieval Agents: Internalizing Local Contextand Scaling up to the Web
Machine Learning - Special issue on information retrieval
Accelerated focused crawling through online relevance feedback
Proceedings of the 11th international conference on World Wide Web
Analysis of anchor text for web search
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Deriving link-context from HTML tag tree
DMKD '03 Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
Focused crawling by exploiting anchor text using decision tree
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Hi-index | 0.00 |
Topical web crawling technology is important for domain-specific resource discovery. Topical crawlers yield good recall as well as good precision by restricting themselves to a specific domain from web pages. There is an intuition that the text surrounding a link or the link-context on the HMTL page is a good summary of the target page. Motivated by that, This paper investigates some alternative methods and advocates that the link-context derived from reference page's HTML tag tree can provide a wealth of illumination for steering crawler to stay on domain-specific topic. In order that crawler can acquire enough illumination from link-context, we initially look for some referring pages by traversing backward from seed URLs, and then build initial term-based feature set by parsing the link-contexts extracted from those reference web pages. Used to measure the similarity between the crawled pages' link-context, the feature set can be adaptively trained by some link-contexts to relevant pages during crawling. This paper also presents some important metrics and an evaluation function for ranking URLs about pages relevance. A comprehensive experiment has been conducted, the result shows obviously that this approach outperforms Best-First and Breath-First algorithm both in harvest rate and efficiency.