Algorithms for clustering data
Algorithms for clustering data
On the allocation of documents in multiprocessor information retrieval systems
SIGIR '91 Proceedings of the 14th annual international ACM SIGIR conference on Research and development in information retrieval
Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval
SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Searching distributed collections with inference networks
SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
On the reuse of past optimal queries
SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Data structures for efficient broker implementation
ACM Transactions on Information Systems (TOIS)
Syntactic clustering of the Web
Selected papers from the sixth international conference on World Wide Web
Effective retrieval with distributed collections
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
The anatomy of a large-scale hypertextual Web search engine
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Cluster-based language models for distributed retrieval
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Analysis of a very large web search engine query log
ACM SIGIR Forum
Collection selection and results merging with topically organized U.S. patents and TREC data
Proceedings of the ninth international conference on Information and knowledge management
Collection statistics for fast duplicate document detection
ACM Transactions on Information Systems (TOIS)
Information Retrieval
Generalizing GlOSS to Vector-Space Databases and Broker Hierarchies
VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
Server Ranking for Distributed Text Retrieval Systems on the Internet
Proceedings of the Fifth International Conference on Database Systems for Advanced Applications (DASFAA)
Methods for identifying versioned and plagiarized documents
Journal of the American Society for Information Science and Technology
Predictive caching and prefetching of query results in search engines
WWW '03 Proceedings of the 12th international conference on World Wide Web
The Link Database: Fast Access to Graphs of the Web
DCC '02 Proceedings of the Data Compression Conference
Information-theoretic co-clustering
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
The webgraph framework I: compression techniques
Proceedings of the 13th international conference on World Wide Web
Cluster-based retrieval using language models
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Hourly analysis of a very large topically categorized web query log
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
ACM Transactions on Information Systems (TOIS)
Query-driven document partitioning and collection selection
InfoScale '06 Proceedings of the 1st international conference on Scalable information systems
The query-vector document model
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Analyzing imbalance among homogeneous index servers in a web search system
Information Processing and Management: an International Journal
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
A pipelined architecture for distributed text query evaluation
Information Retrieval
Finding near neighbors through cluster pruning
Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
The impact of caching on search engines
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Load-balancing and caching for collection selection architectures
Proceedings of the 2nd international conference on Scalable information systems
Query-sets: using implicit feedback and query patterns to organize web documents
Proceedings of the 17th international conference on World Wide Web
Design trade-offs for search engine caching
ACM Transactions on the Web (TWEB)
Collection selection: ...now, with more documents!
Proceedings of the 3rd international conference on Scalable information systems
ICDM '08 Proceedings of the 2008 Eighth IEEE International Conference on Data Mining
How are we searching the World Wide Web? A comparison of nine search engine transaction logs
Information Processing and Management: an International Journal - Special issue: Formal methods for information retrieval
Sorting out the document identifier assignment problem
ECIR'07 Proceedings of the 29th European conference on IR research
On caching search engine query results
Computer Communications
A Last-Resort Semantic Cache for Web Queries
SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
Location cache for web queries
Proceedings of the 18th ACM conference on Information and knowledge management
Mining Query Logs: Turning Search Usage Data into Knowledge
Foundations and Trends in Information Retrieval
New caching techniques for web search engines
Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Reverted indexing for feedback and expansion
CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Document assignment in multi-site search engines
Proceedings of the fourth ACM international conference on Web search and data mining
Performance evaluation of improved web search algorithms
VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
Timestamp-based result cache invalidation for web search engines
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
A multi-collection latent topic model for federated search
Information Retrieval
ACM Transactions on Information Systems (TOIS)
Towards a distributed search engine
CIAC'10 Proceedings of the 7th international conference on Algorithms and Complexity
Similarity caching in large-scale image retrieval
Information Processing and Management: an International Journal
A five-level static cache architecture for web search engines
Information Processing and Management: an International Journal
Shard ranking and cutoff estimation for topically partitioned collections
Proceedings of the 21st ACM international conference on Information and knowledge management
Hi-index | 0.00 |
This article introduces an architecture for a document-partitioned search engine, based on a novel approach combining collection selection and load balancing, called load-driven routing. By exploiting the query-vector document model, and the incremental caching technique, our architecture can compute very high quality results for any query, with only a fraction of the computational load used in a typical document-partitioned architecture. By trading off a small fraction of the results, our technique allows us to strongly reduce the computing pressure to a search engine back-end; we are able to retrieve more than 2/3 of the top-5 results for a given query with only 10% the computing load needed by a configuration where the query is processed by each index partition. Alternatively, we can slightly increase the load up to 25% to improve precision and get more than 80% of the top-5 results. In fact, the flexibility of our system allows a wide range of different configurations, so as to easily respond to different needs in result quality or restrictions in computing power. More important, the system configuration can be adjusted dynamically in order to fit unexpected query peaks or unpredictable failures. This article wraps up some recent works by the authors, showing the results obtained by tests conducted on 6 million documents, 2,800,000 queries and real query cost timing as measured on an actual index.