Filtered document retrieval with frequency-sorted indexes
Journal of the American Society for Information Science
Self-indexing inverted files for fast text retrieval
ACM Transactions on Information Systems (TOIS)
Managing gigabytes (2nd ed.): compressing and indexing documents and images
Managing gigabytes (2nd ed.): compressing and indexing documents and images
Vector-space ranking with effective early termination
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
A study of smoothing methods for language models applied to Ad Hoc information retrieval
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Scaling up all pairs similarity search
Proceedings of the 16th international conference on World Wide Web
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Efficient document retrieval in main memory
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions
Communications of the ACM - 50th anniversary issue: 1958 - 2008
How do users find things with PubMed?: towards automatic utility evaluation with user simulations
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Pairwise document similarity in large collections with MapReduce
HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
Space-Limited ranked query evaluation using adaptive pruning
WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering
Distributed Scheduling Extension on Hadoop
CloudCom '09 Proceedings of the 1st International Conference on Cloud Computing
MapDupReducer: detecting near duplicates over massive datasets
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Design patterns for efficient graph algorithms in MapReduce
Proceedings of the Eighth Workshop on Mining and Learning with Graphs
Self-taught hashing for fast similarity search
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Efficient partial-duplicate detection based on sequence matching
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Nearest neighbor search: algorithmic perspective
SIGSPATIAL Special
Real-life performance of metric searching
SIGSPATIAL Special
Fast query expansion using approximations of relevance models
CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Large-scale music tag recommendation with explicit multiple attributes
Proceedings of the international conference on Multimedia
MapReduce for information retrieval evaluation: "let's quickly test this on 12 TB of data"
CLEF'10 Proceedings of the 2010 international conference on Multilingual and multimodal information access evaluation: cross-language evaluation forum
Efficient indexing of repeated n-grams
Proceedings of the fourth ACM international conference on Web search and data mining
RanKloud: a scalable ranked query processing framework on hadoop
Proceedings of the 14th International Conference on Extending Database Technology
Batch text similarity search with MapReduce
APWeb'11 Proceedings of the 13th Asia-Pacific web conference on Web technologies and applications
Composite hashing with multiple information sources
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
No free lunch: brute force vs. locality-sensitive hashing for cross-lingual pairwise similarity
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Cross-language high similarity search: why no sub-linear time bound can be expected
ECIR'2010 Proceedings of the 32nd European conference on Advances in Information Retrieval
Clustering and load balancing optimization for redundant content removal
Proceedings of the 21st international conference companion on World Wide Web
Finding and exploring memes in social media
Proceedings of the 23rd ACM conference on Hypertext and social media
Optimizing parallel algorithms for all pairs similarity search
Proceedings of the sixth ACM international conference on Web search and data mining
Graph-based semi-supervised learning with multi-modality propagation for large-scale image datasets
Journal of Visual Communication and Image Representation
Computing n-gram statistics in MapReduce
Proceedings of the 16th International Conference on Extending Database Technology
Cache-conscious performance optimization for similarity search
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
The family of mapreduce and large-scale data processing systems
ACM Computing Surveys (CSUR)
Dimension independent similarity computation
The Journal of Machine Learning Research
Hi-index | 0.00 |
This paper explores the problem of computing pairwise similarity on document collections, focusing on the application of "more like this" queries in the life sciences domain. Three MapReduce algorithms are introduced: one based on brute force, a second where the problem is treated as large-scale ad hoc retrieval, and a third based on the Cartesian product of postings lists. Each algorithm supports one or more approximations that trade effectiveness for efficiency, the characteristics of which are studied experimentally. Results show that the brute force algorithm is the most efficient of the three when exact similarity is desired. However, the other two algorithms support approximations that yield large efficiency gains without significant loss of effectiveness.