Pairwise document similarity in large collections with MapReduce

Authors:
Tamer Elsayed;Jimmy Lin;Douglas W. Oard
Affiliations:
University of Maryland, College Park, MD;University of Maryland, College Park, MD;University of Maryland, College Park, MD
Venue:
HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
Year:
2008

Citing 6
Cited 40

Information retrieval: data structures and algorithms

Information retrieval: data structures and algorithms
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Scaling up all pairs similarity search

Proceedings of the 16th international conference on World Wide Web
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Improving text classification for oral history archives with temporal domain knowledge

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions

Communications of the ACM - 50th anniversary issue: 1958 - 2008

Brute force and indexed approaches to pairwise document similarity comparisons with MapReduce

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Scalable language processing algorithms for the masses: a case study in computing word co-occurrence matrices with MapReduce

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Evaluating SPLASH-2 Applications Using MapReduce

APPT '09 Proceedings of the 8th International Symposium on Advanced Parallel Processing Technologies
Exploring large-data issues in the curriculum: a case study with MapReduce

TeachCL '08 Proceedings of the Third Workshop on Issues in Teaching Computational Linguistics
Canonical image selection and efficient image graph construction for large-scale flickr photos

MM '09 Proceedings of the 17th ACM international conference on Multimedia
Packing the most onto your cloud

Proceedings of the first international workshop on Cloud data management
Arabic cross-document coreference detection

ACLShort '09 Proceedings of the ACL-IJCNLP 2009 Conference Short Papers
Web-scale distributional similarity and entity set expansion

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2
Max-cover in map-reduce

Proceedings of the 19th international conference on World wide web
Pairwise Element Computation with MapReduce

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
From frequency to meaning: vector space models of semantics

Journal of Artificial Intelligence Research
Fast query expansion using approximations of relevance models

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
A technical demonstration of large-scale image object retrieval by efficient query evaluation and effective auxiliary visual feature discovery

Proceedings of the international conference on Multimedia
Large-scale multimodal mining for healthcare with mapreduce

Proceedings of the 1st ACM International Health Informatics Symposium
Parallel implementation of classification algorithms based on MapReduce

RSKT'10 Proceedings of the 5th international conference on Rough set and knowledge technology
Efficient indexing of repeated n-grams

Proceedings of the fourth ACM international conference on Web search and data mining
Macademia: semantic visualization of research interests

Proceedings of the 16th international conference on Intelligent user interfaces
RanKloud: a scalable ranked query processing framework on hadoop

Proceedings of the 14th International Conference on Extending Database Technology
Mavuno: a scalable and effective Hadoop-based paraphrase acquisition system

Proceedings of the Third Workshop on Large Scale Data Mining: Theory and Applications
No free lunch: brute force vs. locality-sensitive hashing for cross-lingual pairwise similarity

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
An event-centric model for multilingual document similarity

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
A distributed look-up architecture for text mining applications using MapReduce

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Block-based load balancing for entity resolution with MapReduce

Proceedings of the 20th ACM international conference on Information and knowledge management
Learning-based entity resolution with MapReduce

Proceedings of the third international workshop on Cloud data management
Case study of scientific data processing on a cloud using hadoop

HPCS'09 Proceedings of the 23rd international conference on High Performance Computing Systems and Applications
V-SMART-join: a scalable mapreduce framework for all-pair similarity joins of multisets and vectors

Proceedings of the VLDB Endowment
InfoGather: entity augmentation and attribute discovery by holistic matching with web tables

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Towards building large-scale distributed systems for twitter sentiment analysis

Proceedings of the 27th Annual ACM Symposium on Applied Computing
Learning by expansion: Exploiting social media for image classification with few training examples

Neurocomputing
MapReduce algorithms for big data analysis

Proceedings of the VLDB Endowment
Multimedia Applications and Security in MapReduce: Opportunities and Challenges

Concurrency and Computation: Practice & Experience
Computing scientometrics in large-scale academic search engines with mapreduce

WISE'12 Proceedings of the 13th international conference on Web Information Systems Engineering
Automatic thesaurus construction for cross generation corpus

Journal on Computing and Cultural Heritage (JOCCH)
Breaking the MapReduce stage barrier

Cluster Computing
Graph-based semi-supervised learning with multi-modality propagation for large-scale image datasets

Journal of Visual Communication and Image Representation
Don't match twice: redundancy-free similarity computation with MapReduce

Proceedings of the Second Workshop on Data Analytics in the Cloud
Scalable all-pairs similarity search in metric spaces

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Simplifying MapReduce data processing

International Journal of Computational Science and Engineering
Dimension independent similarity computation

The Journal of Machine Learning Research
Online image search result grouping with MapReduce-based image clustering and graph construction for large-scale photos

Journal of Visual Communication and Image Representation

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a MapReduce algorithm for computing pairwise document similarity in large document collections. MapReduce is an attractive framework because it allows us to decompose the inner products involved in computing document similarity into separate multiplication and summation stages in a way that is well matched to efficient disk access patterns across several machines. On a collection consisting of approximately 900,000 newswire articles, our algorithm exhibits linear growth in running time and space in terms of the number of documents.