Machine Learning
Introduction to Algorithms
Cumulated gain-based evaluation of IR techniques
ACM Transactions on Information Systems (TOIS)
A Comparative Study on Feature Selection in Text Categorization
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
ACM Transactions on Information Systems (TOIS)
The impact of caching on search engines
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
On the feasibility of geographically distributed web crawling
Proceedings of the 3rd international conference on Scalable information systems
Improved techniques for result caching in web search engines
Proceedings of the 18th international conference on World wide web
Efficiency trade-offs in two-tier web search systems
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
On the feasibility of multi-site web search engines
Proceedings of the 18th ACM conference on Information and knowledge management
The WEKA data mining software: an update
ACM SIGKDD Explorations Newsletter
A refreshing perspective of search engine caching
Proceedings of the 19th international conference on World wide web
Caching search engine results over incremental indices
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Query forwarding in geographically distributed search engines
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Document assignment in multi-site search engines
Proceedings of the fourth ACM international conference on Web search and data mining
Cost-Aware Strategies for Query Result Caching in Web Search Engines
ACM Transactions on the Web (TWEB)
Timestamp-based result cache invalidation for web search engines
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Assigning documents to master sites in distributed search
Proceedings of the 20th ACM international conference on Information and knowledge management
On caching search engine query results
Computer Communications
Prefetching query results and its impact on search engines
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Online result cache invalidation for real-time web search
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Document replication strategies for geographically distributed web search engines
Information Processing and Management: an International Journal
Hi-index | 0.00 |
A multi-site web search engine is composed of a number of search sites geographically distributed around the world. Each search site is typically responsible for crawling and indexing the web pages that are in its geographical neighborhood. A query is selectively processed on a subset of search sites that are predicted to return the best-matching results. The scalability and efficiency of multi-site web search engines have attracted a lot of research attention in recent years. In particular, research has focused on replicating important web pages across sites, forwarding queries to relevant sites, and caching results of previous queries. Yet, these problems have only been studied in isolation, but no prior work has properly investigated the interplay between them. In this paper, we take this challenge up and conduct what we believe is the first comprehensive analysis of a full stack of techniques for efficient multi-site web search. Specifically, we propose a document replication technique that improves the query locality of the state-of-the-art approaches with various replication budget distribution strategies. We devise a machine learning approach to decide the query forwarding patterns, achieving a significantly lower false positive ratio than a state-of-the-art thresholding approach with little negative impact on search result quality. We propose three result caching strategies that reduce the number of forwarded queries and analyze the trade-off they introduce in terms of storage and network overheads. Finally, we show that the combination of the best-of-the-class techniques yields very promising search efficiency, rendering multi-site, geographically distributed web search engines an attractive alternative to centralized web search engines.