Brute force and indexed approaches to pairwise document similarity comparisons with MapReduce

Authors:
Jimmy Lin
Affiliations:
University of Maryland, College Park, MD, USA
Venue:
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Year:
2009

Citing 13
Cited 24

Filtered document retrieval with frequency-sorted indexes

Journal of the American Society for Information Science
Self-indexing inverted files for fast text retrieval

ACM Transactions on Information Systems (TOIS)
Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
Vector-space ranking with effective early termination

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
A study of smoothing methods for language models applied to Ad Hoc information retrieval

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Scaling up all pairs similarity search

Proceedings of the 16th international conference on World Wide Web
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Efficient document retrieval in main memory

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions

Communications of the ACM - 50th anniversary issue: 1958 - 2008
How do users find things with PubMed?: towards automatic utility evaluation with user simulations

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Pairwise document similarity in large collections with MapReduce

HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
Space-Limited ranked query evaluation using adaptive pruning

WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering

Distributed Scheduling Extension on Hadoop

CloudCom '09 Proceedings of the 1st International Conference on Cloud Computing
MapDupReducer: detecting near duplicates over massive datasets

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Design patterns for efficient graph algorithms in MapReduce

Proceedings of the Eighth Workshop on Mining and Learning with Graphs
Self-taught hashing for fast similarity search

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Efficient partial-duplicate detection based on sequence matching

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Nearest neighbor search: algorithmic perspective

SIGSPATIAL Special
Real-life performance of metric searching

SIGSPATIAL Special
Fast query expansion using approximations of relevance models

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Large-scale music tag recommendation with explicit multiple attributes

Proceedings of the international conference on Multimedia
MapReduce for information retrieval evaluation: "let's quickly test this on 12 TB of data"

CLEF'10 Proceedings of the 2010 international conference on Multilingual and multimodal information access evaluation: cross-language evaluation forum
Efficient indexing of repeated n-grams

Proceedings of the fourth ACM international conference on Web search and data mining
RanKloud: a scalable ranked query processing framework on hadoop

Proceedings of the 14th International Conference on Extending Database Technology
Batch text similarity search with MapReduce

APWeb'11 Proceedings of the 13th Asia-Pacific web conference on Web technologies and applications
Composite hashing with multiple information sources

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
No free lunch: brute force vs. locality-sensitive hashing for cross-lingual pairwise similarity

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Cross-language high similarity search: why no sub-linear time bound can be expected

ECIR'2010 Proceedings of the 32nd European conference on Advances in Information Retrieval
Clustering and load balancing optimization for redundant content removal

Proceedings of the 21st international conference companion on World Wide Web
Finding and exploring memes in social media

Proceedings of the 23rd ACM conference on Hypertext and social media
Optimizing parallel algorithms for all pairs similarity search

Proceedings of the sixth ACM international conference on Web search and data mining
Graph-based semi-supervised learning with multi-modality propagation for large-scale image datasets

Journal of Visual Communication and Image Representation
Computing n-gram statistics in MapReduce

Proceedings of the 16th International Conference on Extending Database Technology
Cache-conscious performance optimization for similarity search

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
The family of mapreduce and large-scale data processing systems

ACM Computing Surveys (CSUR)
Dimension independent similarity computation

The Journal of Machine Learning Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper explores the problem of computing pairwise similarity on document collections, focusing on the application of "more like this" queries in the life sciences domain. Three MapReduce algorithms are introduced: one based on brute force, a second where the problem is treated as large-scale ad hoc retrieval, and a third based on the Cartesian product of postings lists. Each algorithm supports one or more approximations that trade effectiveness for efficiency, the characteristics of which are studied experimentally. Results show that the brute force algorithm is the most efficient of the three when exact similarity is desired. However, the other two algorithms support approximations that yield large efficiency gains without significant loss of effectiveness.