Finding potential seeds through rank aggregation of web searches

Authors:
Rajendra Prasath;Pinar Oztürk
Affiliations:
Department of Computer and Information Science (IDI), Norwegian University of Science and Technology (NTNU), Trondheim, Norway;Department of Computer and Information Science (IDI), Norwegian University of Science and Technology (NTNU), Trondheim, Norway
Venue:
PReMI'11 Proceedings of the 4th international conference on Pattern recognition and machine intelligence
Year:
2011

Citing 7
Cited 0

The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
Rank aggregation methods for the Web

Proceedings of the 10th international conference on World Wide Web
A bootstrapping approach to named entity classification using successive learners

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Counter-training in discovery of semantic patterns

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Using similarity links as shortcuts to relevant web pages

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Graph based crawler seed selection

Proceedings of the 18th international conference on World wide web

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a potential seed selection algorithm for web crawlers using a gain - share scoring approach. Initially we consider a set of arbitrarily chosen tourism queries. Each query is given to the selected N commercial Search Engines (SEs); top msearch results for each SE are obtained, and each of these mresults is manually evaluated and assigned a relevance score. For each of m results, a gain - share score is computed using their hyperlinks structure across N ranked lists. Gain score of each link present in each of m results and a portion of the gain score is propagated to the share score of each of m results. This updated share scores of each of m results determine the potential set of seed URLs for web crawling. Experimental results on tourism related web data illustrate the effectiveness of the proposed seed selection algorithm.